Difference between revisions of "MSymond1 Week 9"
(→Sanity Check: Fixed B-H p value) |
(→ANOVA: copied the protocol from the assignment) |
||
Line 3: | Line 3: | ||
==Methods & Results== | ==Methods & Results== | ||
===ANOVA=== | ===ANOVA=== | ||
− | + | # A new worksheet was created, named "(STRAIN)_ANOVA" as appropriate. For example, you might call yours "wt_ANOVA" or "dCIN5_ANOVA". | |
+ | # All data was copied from the "Master_Sheet" worksheet and pasted it in the new worksheet. | ||
+ | # At the top of the first column to the right of your data, five column headers were created of the form (STRAIN)_AvgLogFC_(TIME) where STRAIN was your strain designation and (TIME) is 15, 30, 60, 90, and 120. | ||
+ | # In the cell below the (STRAIN)_AvgLogFC_t15 header, the syntax <code>=AVERAGE(</code> was typed | ||
+ | # Then all the data in row 2 associated with t15 was highlighted, the closing paren key was pressed (shift 0),and press the "enter" key. | ||
+ | # This cell contained the average of the log fold change data from the first gene at t=15 minutes. | ||
+ | # This cell was clicked on and position the cursor was positioned at the bottom right corner. The cursor changed to a thin black plus sign (not a chubby white one). When it did, it was double clicked, and the formula magically copied to the entire column of 6188 other genes. | ||
+ | # steps (4) through (8) were repeated with the t30, t60, t90, and the t120 data. | ||
+ | # Now in the first empty column to the right of the (STRAIN)_AvgLogFC_t120 calculation, the column header (STRAIN)_ss_HO was created. | ||
+ | # In the first cell below this header, <code>=SUMSQ(</code> was typed | ||
+ | # All the LogFC data in row 2 was highlighted (but not the AvgLogFC), the closing paren key was pressed (shift 0),and then the "enter" key. | ||
+ | # In the next empty column to the right of (STRAIN)_ss_HO, the column headers (STRAIN)_ss_(TIME) as in (3) were created. | ||
+ | # A note of how many data points you have at each time point for your strain was made. For most of the strains, it was be 4, but for the wild type it was be "4" or "5". | ||
+ | # In the first cell below the header (STRAIN)_ss_t15, <code>=SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2</code> was typed and then enter. | ||
+ | #* The <code>COUNTA</code> function counts the number of cells in the specified range that have data in them (i.e., does not count cells with missing values). | ||
+ | #* The phrase <range of cells for logFC_t15> should be replaced by the data range associated with t15. | ||
+ | #* The phrase <AvgLogFC_t15> should be replaced by the cell number in which you computed the AvgLogFC for t15, and the "^2" squares that value. | ||
+ | #* Upon completion of this single computation, use the Step (7) trick to copy the formula throughout the column. | ||
+ | # This computation was repeated for the t30 through t120 data points. | ||
+ | # In the first column to the right of (STRAIN)_ss_t120, the column header (STRAIN)_SS_full was created. | ||
+ | # In the first row below this header, <code>=sum(<range of cells containing "ss" for each timepoint>)</code> was typed and entered. | ||
+ | # In the next two columns to the right, the headers (STRAIN)_Fstat and (STRAIN)_p-value were created. | ||
+ | # it was Recalled the number of data points from (13): call that total n. | ||
+ | # In the first cell of the (STRAIN)_Fstat column, it was typed <code>=((n-5)/5)*(<(STRAIN)_ss_HO>-<(STRAIN)_SS_full>)/<(STRAIN)_SS_full></code> and entered. | ||
+ | #* the n was not typed, but instead the number from (13). Also noted that "5" is the number of timepoints.<!-- and the dSWI4 strain has 4 timepoints (it is missing t15).--> | ||
+ | #* the phrase (STRAIN)_ss_HO was replaced with the cell designation. | ||
+ | #* the phrase <(STRAIN)_SS_full> was replaced with the cell designation. | ||
+ | #* this was Copy to the whole column. | ||
+ | # In the first cell below the (STRAIN)_p-value header, it was typed <code>=FDIST(<(STRAIN)_Fstat>,5,n-5)</code> replacing the phrase <(STRAIN)_Fstat> with the cell designation and the "n" as in (13) with the number of data points total. <!--(Again, note that the number of timepoints is actually "4" for the dSWI4 strain)--> it was copied to the whole column. | ||
+ | # Before it was moved on to the next step, it was performed a quick sanity check to see if all of these computations were done correctly. | ||
+ | #* Cell A1 was clicked on and the Data tab. Select the Filter icon (looks like a funnel). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set. | ||
+ | #* The drop-down arrow on your (STRAIN)_p-value column was clicked on. It was selected, "Number Filters". In the window that appears, a criterion was set that will filter your data so that the p value has to be less than 0.05. | ||
+ | #* Excel only displayed the rows that corresponded to data meeting that filtering criterion. A number appeared in the lower left hand corner of the window giving you the number of rows that meet that criterion. We will check our results with each other to make sure that the computations were performed correctly. | ||
+ | #* any filters were undone that have been applied before making any additional calculations. | ||
+ | |||
===Bonferroni and p-value correction=== | ===Bonferroni and p-value correction=== | ||
To calculate the Bonferroni p-value, the original p values were all multiplied by 6189. Then, in another column, the Bonferroni p values were all either changed to 1 if they were greater than 1, or they were reported as their Bonferroni p-values if they were less than 1. | To calculate the Bonferroni p-value, the original p values were all multiplied by 6189. Then, in another column, the Bonferroni p values were all either changed to 1 if they were greater than 1, or they were reported as their Bonferroni p-values if they were less than 1. |
Revision as of 20:42, 2 April 2024
Contents
Purpose
This lab was conducted in Microsoft Excel to analyze a microarray dataset. Each group in class was given a different dataset for a different strain of data. The strain used in the present study was the wild type. The ANOVA tests done on the dataset determined which genes had a p value of less than .05, and they were then further analyzed further to discover that an ANOVA test alone is not enough to determine if there is a statistically significant change in the genes over time.
Methods & Results
ANOVA
- A new worksheet was created, named "(STRAIN)_ANOVA" as appropriate. For example, you might call yours "wt_ANOVA" or "dCIN5_ANOVA".
- All data was copied from the "Master_Sheet" worksheet and pasted it in the new worksheet.
- At the top of the first column to the right of your data, five column headers were created of the form (STRAIN)_AvgLogFC_(TIME) where STRAIN was your strain designation and (TIME) is 15, 30, 60, 90, and 120.
- In the cell below the (STRAIN)_AvgLogFC_t15 header, the syntax
=AVERAGE(
was typed - Then all the data in row 2 associated with t15 was highlighted, the closing paren key was pressed (shift 0),and press the "enter" key.
- This cell contained the average of the log fold change data from the first gene at t=15 minutes.
- This cell was clicked on and position the cursor was positioned at the bottom right corner. The cursor changed to a thin black plus sign (not a chubby white one). When it did, it was double clicked, and the formula magically copied to the entire column of 6188 other genes.
- steps (4) through (8) were repeated with the t30, t60, t90, and the t120 data.
- Now in the first empty column to the right of the (STRAIN)_AvgLogFC_t120 calculation, the column header (STRAIN)_ss_HO was created.
- In the first cell below this header,
=SUMSQ(
was typed - All the LogFC data in row 2 was highlighted (but not the AvgLogFC), the closing paren key was pressed (shift 0),and then the "enter" key.
- In the next empty column to the right of (STRAIN)_ss_HO, the column headers (STRAIN)_ss_(TIME) as in (3) were created.
- A note of how many data points you have at each time point for your strain was made. For most of the strains, it was be 4, but for the wild type it was be "4" or "5".
- In the first cell below the header (STRAIN)_ss_t15,
=SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2
was typed and then enter.- The
COUNTA
function counts the number of cells in the specified range that have data in them (i.e., does not count cells with missing values). - The phrase <range of cells for logFC_t15> should be replaced by the data range associated with t15.
- The phrase <AvgLogFC_t15> should be replaced by the cell number in which you computed the AvgLogFC for t15, and the "^2" squares that value.
- Upon completion of this single computation, use the Step (7) trick to copy the formula throughout the column.
- The
- This computation was repeated for the t30 through t120 data points.
- In the first column to the right of (STRAIN)_ss_t120, the column header (STRAIN)_SS_full was created.
- In the first row below this header,
=sum(<range of cells containing "ss" for each timepoint>)
was typed and entered. - In the next two columns to the right, the headers (STRAIN)_Fstat and (STRAIN)_p-value were created.
- it was Recalled the number of data points from (13): call that total n.
- In the first cell of the (STRAIN)_Fstat column, it was typed
=((n-5)/5)*(<(STRAIN)_ss_HO>-<(STRAIN)_SS_full>)/<(STRAIN)_SS_full>
and entered.- the n was not typed, but instead the number from (13). Also noted that "5" is the number of timepoints.
- the phrase (STRAIN)_ss_HO was replaced with the cell designation.
- the phrase <(STRAIN)_SS_full> was replaced with the cell designation.
- this was Copy to the whole column.
- In the first cell below the (STRAIN)_p-value header, it was typed
=FDIST(<(STRAIN)_Fstat>,5,n-5)
replacing the phrase <(STRAIN)_Fstat> with the cell designation and the "n" as in (13) with the number of data points total. it was copied to the whole column. - Before it was moved on to the next step, it was performed a quick sanity check to see if all of these computations were done correctly.
- Cell A1 was clicked on and the Data tab. Select the Filter icon (looks like a funnel). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
- The drop-down arrow on your (STRAIN)_p-value column was clicked on. It was selected, "Number Filters". In the window that appears, a criterion was set that will filter your data so that the p value has to be less than 0.05.
- Excel only displayed the rows that corresponded to data meeting that filtering criterion. A number appeared in the lower left hand corner of the window giving you the number of rows that meet that criterion. We will check our results with each other to make sure that the computations were performed correctly.
- any filters were undone that have been applied before making any additional calculations.
Bonferroni and p-value correction
To calculate the Bonferroni p-value, the original p values were all multiplied by 6189. Then, in another column, the Bonferroni p values were all either changed to 1 if they were greater than 1, or they were reported as their Bonferroni p-values if they were less than 1.
Benjamini & Hochberg p-value correction
A new worksheet was created in Excel with all of the raw p-values, the p-values were ranked from least to greatest in this new worksheet. To calculate the new p-values, the p-values were multiplied by 6189 again, and then they were divided by their rank of the p-values. The same process was repeated from the Bonferroni calculations in which they were only reported if they were less than 1.
Sanity Check
- NSR1
- Unadjusted p-value: 2.86939E-10
- Bonferroni p-value: 1.77586E-06
- B-H p-value: 8.87931530097403E-07
- Average log
- 15: 3.279225
- 30: 3.621
- 60: 3.526525
- 90: -2.04985
- 120: -0.60622
- According to NSR1's unadjusted p-value, and its Bonferroni p-value, it does undergo change in expression during the experiment, but not according to the B-H p-value.
- MSN1
- Unadjusted p-value: 0.563798852
- Bonferroni p-value: 3489.351093
- B-H p-value: 0.679258535
- Average log
- 15: 0.1076
- 30: -0.46192
- 60: -0.47075
- 90: 0.16805
- 120: -0.18418
- According to all 3 p-values, it does not undergo change in expression during the experiment.
Data & Files
media:Symonds_Miller_BIOL367_S24_microarray-data_wt.xlsx.zip media:Symonds_BIOL367_S24_sample_p-value_slide.pdf
Conclusion
This experiment gives many implications to the viability of using p-values from ANOVA tests. Several of the p values that came out as less than .05 from the ANOVA test did not have a p-value of less than .05 for the Bonferroni test or the B-H test. Not to mention that the p-value being less than .05 is an arbitrary comparison to begin with. With a sample size this large, it is very important to understand that there will be many times the p values seem to indicate a significant result when there in fact is not one. The Bonferroni tests and B-H tests also yielded far less p values that were less than .05. It is very interesting that this has such large implications for the field of biology and most of the sciences, but this has never been taught to the experimenter before in this study.
Acknowledgements
I worked with my homework partner, Katie Miller during class as our professor, Dr. Dahlquist showed us the protocol for how to run the tests on the database in Excel. I did not communicate with Katie or Dr. Dahlquist outside of class for the assignment. Except for what is noted above, this individual journal entry was completed by me and not copied from another source. Msymond1 (talk) 21:21, 20 March 2024 (PDT)
References
LMU BioDB 2024. (2024). Week 9. Retrieved March 20, 2024, from https://xmlpipedb.cs.lmu.edu/biodb/spring2024/index.php/Week_9
User Page
Assignment Pages
Individual Journal Pages
- MSymond1 Week 1
- MSymond1 Week 2
- MSymond1 KMill104 Week 3
- NeMO_Week4
- MSymond1 Week 5
- MSymond1 Week 6
- MSymond1 Week 8
- MSymond1 Week 9
- MSymond1 Week 10
- MSymond1 Week 12
- MSymond1 Week 13
- MSymond1 Week 15