Hivanson Week 9
Contents
Purpose
The purpose of this procedure is to run an ANOVA test on the up- or down-regulation of all yeast genes in the S. cerevisiae genome. We will view different variations of p-values to understand how certain cutoffs can be more or less stringent, and learn when each type should be used, as well as what level of significance is appropriate for different uses.
Methods/Results
Strain name: ∆CIN5 strain
Filename: HI_BIOL367_S24_microarray-data_dCIN5.xcls
Number of replicates per strain: 4
Timepoints: 15 minutes, 30 minutes, 60 minutes, 90 minutes, 120 minutes
Statistical Analysis Part 1: ANOVA
- I created a new worksheet and named it "dCIN5_ANOVA"
- I copied all data from the "Master_Sheet" worksheet and pasted it into dCIN5_ANOVA.
- I created five column headers of the form dCIN5_AvgLogFC_(TIME) where (TIME) is 15, 30, 60, 90, and 120.
- In the cell below the dCIN5_AvgLogFC_t15 header, I typed
=AVERAGE(
- Then I highlighted all the data in row 2 associated with t15, pressed the closing paren key, and pressed the "enter" key.
- I extended this formula down for all genes.
- I repeated this averaging process with the t30, t60, t90, and the t120 data.
- In the first empty column to the right of the dCIN5_AvgLogFC_t120 calculation, I created the column header dCIN5_ss_HO.
- In the first cell below this header, I typed
=SUMSQ(
- I highlighted all the LogFC data in row 2 until the average, pressed the closing paren key, and pressed the "enter" key.
- In the next empty column to the right of dCIN5_ss_HO, I created the column headers dCIN5_ss_(TIME) as in (3).
- In the first cell below the header dCIN5_ss_t15, I typed
=SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2
and hit enter. - I extended this formula down for all genes.
- I repeated this computation for the t30 through t120 data points. =
- In the first column to the right of dCIN5_ss_t120, I created the column header dCIN5_SS_full.
- In the first row below this header, I type
=sum(<range of cells containing "ss" for each timepoint>)
and hit enter. - In the next two columns to the right, I created the headers dCIN5_Fstat and dCIN5_p-value.
- In the first cell of the dCIN5_Fstat column, I typed
=((20-5)/5)*(<dCIN5_ss_HO>-<dCIN5_SS_full>)/<dCIN5_SS_full>
and hit enter.- I replaced the phrase dCIN5_ss_HO with the cell designation.
- I replaced the phrase <dCIN5_SS_full> with the cell designation.
- I copied this to the whole column.
- In the first cell below the dCIN5_p-value header, I typed
=FDIST(<dCIN5_Fstat>,5,20-5)
- I performed a quick sanity check to see if all of these computations were done correctly.
- I filtered the dCIN5_p-value column so that the p value has to be less than 0.05.
- Before further calculation, I undid this filter.
Calculating the Bonferroni and p value Correction
- I labeled the next two columns to the right with the same label, dCIN5_Bonferroni_p-value.
- I type the equation
=<dCIN5_p-value>*6189
, and copied to all genes - I replace any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second dCIN5_Bonferroni_p-value header:
=IF(dCIN5_Bonferroni_p-value>1,1,dCIN5_Bonferroni_p-value)
, and copied to all genes.
Calculating the Benjamini & Hochberg p value Correction
- I inserted a new worksheet named "b-h_ANOVA".
- I copy and paste the "MasterIndex", "ID", and "Standard Name" columns from Master_Sheet_dCIN5 into the first two columns of the new worksheet.
- I copied my unadjusted p values from your ANOVA worksheet and pasted it into Column D using "paste values."
- I selected all of columns A, B, C, and D. Sort by ascending values on Column D.
- I typed the header "Rank" in cell E1 and created a series of numbers in ascending order from 1 to 6189 in this column.
- To calculate the Benjamini and Hochberg p value correction, I typed dCIN5_B-H_p-value in cell F1. I typed the following formula in cell F2:
=(D2*6189)/E2
and pressed enter. I copied that equation to the entire column. - I typed "dCIN5-H_p-value" into cell G1.
- I typed the following formula into cell G2:
=IF(F2>1,1,F2)
and pressed enter. I copied that equation to the entire column. - I selected columns A through G.
- I sorted them by my Column A MasterIndex in ascending order.
- I copied column G and used paste values to paste it into the next column on the right of your ANOVA_dCIN5 sheet.
Sanity Check: Number of genes significantly changed
- In the ANOVA_dCIN5 worksheet, I filtered the unadjusted p value to display only those with a p value of less than 0.05, 0.01, 0.001, and 0.0001.
- I used
=SUBTOTAL(3,A:A)
to count the total output, then subtracted 1 to get the number of genes that fit the filter. - For the percentage, I used
=(100*(<subtotal>-1))/6189)
. Results as follows:
- How many genes have p < 0.05? and what is the percentage (out of 6189)?
- 2290 genes; 37.0%
- How many genes have p < 0.01? and what is the percentage (out of 6189)?
- 1380 genes; 22.3%
- How many genes have p < 0.001? and what is the percentage (out of 6189)?
- 691 genes; 11.2%
- How many genes have p < 0.0001? and what is the percentage (out of 6189)?
- 358 genes; 5.8%
- I repeated the above steps for the Bonferroni-corrected p value of less than 0.05, and the Benjamini and Hochberg-corrected p value of less than 0.05. Results are as follows:
- How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)?
- 151 genes; 2.4%
- How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 6189)?
- 1453 genes; 23.5%
Find NSR1 in your dataset. What is its unadjusted, Bonferroni-corrected, and B-H-corrected p values? What is its average Log fold change at each of the timepoints in the experiment?
- Unadjusted p value: 6.37625E-08
- Bonferroni-corrected p value: 0.000394626
- B-H-corrected p value: 2.19237E-05
- Average Log fold change @ 15 minutes: 4.070025
- Average Log fold change @ 30 minutes: 3.611475
- Average Log fold change @ 60 minutes: 4.2985
- Average Log fold change @ 90 minutes: -2.900925
- Average Log fold change @ 120 minutes: -0.9315
- NSR1 shows increased expression from time 15 minutes through 60 minutes. At 90 minutes, NSR1 expression decreases, and at 120 minutes, the expression of NSR1 remains decreased. There could be considered a significant increase or decrease for at least one of these points, as all p values are below 0.05. Further testing could be done to confirm which point/s this includes.
What is IMD3's unadjusted, Bonferroni-corrected, and B-H-corrected p values? What is IMD3's average Log fold change at each of the timepoints in the experiment?
- Unadjusted p value: 0.111670609
- Bonferroni-corrected p value: 1
- B-H-corrected p value: 0.232000469
- Average Log fold change @ 15 minutes: 1.638433333
- Average Log fold change @ 30 minutes: -0.100766667
- Average Log fold change @ 60 minutes: 1.659233333
- Average Log fold change @ 90 minutes: -0.608333333
- Average Log fold change @ 120 minutes: -0.168133333
- IMD3 does not significantly increase or decrease, as all p values are above 0.05.
Data & Files
Excel microarray data for ∆CIN5
Conclusion
The week 9 procedure walked us through how to run an ANOVA on Excel using DNA microarray data on the effects of cold stress on all genes in the S. cerevisiae genome. The results yielded were various types of p-values (standard, Bonferroni, and Benjamini & Hochberg) with different levels of stringency for cutoff. Based on percentages of ∆CIN5 genes significant at the specified level, stringency was highest for standard p < 0.05 (37.0%), then Benjamini & Hochberg p < 0.05 (23.5%), standard p < 0.01 (22.3%), standard p < 0.001 (11.2%), standard p < 0.0001 (5.8%), and finally Bonferroni p < 0.05 (2.4%). This exploration allowed us to better understand how p values are used in statistics, and how different modifications of the p value may be applied. This was the first time that I learned about the lower right corner double click trick to copy an equation to all cells in a column, which will be useful for biological database uses and beyond. Further, I learned how to filter data in an Excel sheet. Both of these skills are fairly basic, but I was never formally taught Excel and just didn't look into either function. Overall, I learned a lot about both Excel and statistics through this exercise.
Acknowledgments
I worked with my homework partner Natalija Stojanovic in class on 3/13/2024 and 3/18/2024 under the guidance of Dr. Kam Dahlquist. All procedure was edited from LMU Bio DB Week 9.
Except for what is noted above, this individual journal entry was completed by me and not copied from another source.
Hivanson (talk) 23:40, 20 March 2024 (PDT)
References
- Dahlquist, K. (2018, June 22). Global transcriptional response of wild type and transcription factor deletion strains of Saccharomyces cerevisiae to the environmental stress of cold shock and subsequent recovery. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE83656
- LMU BioDB 2024. (2024). Week 9. Retrieved March 20, 2024, from https://xmlpipedb.cs.lmu.edu/biodb/spring2024/index.php/Week_9.
- Hivanson
- Hivanson Week 1 | Week 1 Assignment
- Hivanson Week 2 | Week 2 Assignment
- IMD3 Hivanson and Nstojan1 Week 3 | Week 3 Assignment
- NeMO Week 4 | Week 4 Assignment
- Hivanson Week 5 | Week 5 Assignment
- Hivanson Week 6 | Week 6 Assignment
- Hivanson Week 8 | Week 8 Assignment
- Hivanson Week 9 | Week 9 Assignment
- Hivanson Week 10 | Week 10 Assignment
- Hivanson Week 12 | Week 12 Assignment
- Hivanson Week 13 | Week 13 Assignment
- Hivanson Week 14 | Week 14 Assignment
- Hivanson Week 15 | Week 15 Assignment
- Main page