Dmadere Week 8

Purpose

The purpose of this experiment was to analyze microarray data from Saccharomyces and see their change in gene expression after undergoing and recovering from cold shock.

Method/Results

Statistical Analysis Part1: ANOVA

Created a new worksheet named “dCIN5_ANOVA”.
Copied the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for dCIN5 and pasted it into your new worksheet. Copied the columns containing the data for dCIN5 and pasted it into your new worksheet.
At the top of the first column to the right of data, created five column headers of the form dCIN5_AvgLogFC_(TIME) where (TIME) is 15, 30, etc.
In the cell below the dCIN5_AvgLogFC_t15 header, typed =AVERAGE(
Highlighted all the data in row 2 associated with t15, pressed the closing paren key (shift 0), and pressed the "enter" key.
This cells contain the average of the log fold change data from the first gene at t=15 minutes.
Click on this cell and position your cursor at the bottom right corner. You should see your cursor change to a thin black plus sign (not a chubby white one). When it does, double click, and the formula will magically be copied to the entire column of 6188 other genes.
Repeated steps (4) through (8) with the t30, t60, t90, and the t120 data.
In the first empty column to the right of the dCIN_AvgLogFC_t120 calculation, created the column header dCIN5_ss_HO.
In the first cell below header, typed =SUMSQ(
Highlighted all the LogFC data in row 2 (but not the AvgLogFC), pressed the closing paren key (shift 0), and pressed the "enter" key.
In the next empty column to the right of dCIN5_ss_HO, created the column headers dCIN5_ss_(TIME) as in (3).
Made a note of how many data points had at each time point for dCIN5. For dCIN5, there are 4. Counted carefully. Also, made a note of the total number of data points. For dCIN5, there was 20.
Below the header dCIN5_ss_t15, typed =SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2 and hit enter.
- The COUNTA function counts the number of cells in the specified range that have data in them (i.e., does not count cells with missing values).
- The phrase <range of cells for logFC_t15> should be replaced by the data range associated with t15.
- The phrase <AvgLogFC_t15> should be replaced by the cell number in which you computed the AvgLogFC for t15, and the "^2" squares that value.
- Upon completion of this single computation, used the Step (7) trick to copy the formula throughout the column.
Repeated this computation for the t30 through t120 data points. Obtained the data for each time point, types the right number of data points, and got average from the appropriate cell for each time point, and copied the formula to the whole column for each computation.
In the first column to the right of dCIN5_ss_t120, created the column header dCIN5_SS_full.
In the first row below this header, typed =sum(<range of cells containing "ss" for each timepoint>) and pressed enter.
In the next two columns to the right, created the headers dCIN5_Fstat and dCIN5_p-value.
Recalled the number of data points from (13): called that total n.
In the first cell of the dCIN5_Fstat column, typed =((n-5)/5)*(<dCIN5_ss_HO>- <dCIN5_SS_full>)/<dCIN5_SS_full> and pressed enter.
Didn’t actually type the n but instead used the number from (13). Also note that "5" is the number of timepoints.
- Replaced the phrase dCIN5_ss_HO with the cell designation.
- Replaced the phrase <dCIN5_SS_full> with the cell designation.
- Copied to the whole column.
In the first cell below the dCIN5_p-value header, typed =FDIST(<dCIN5_Fstat>,5,n-5) replacing the phrase <dCIN5_Fstat> with the cell designation and the "n" as in (13) with the number of data points total. Copied to the whole column.
Before moved on to the next step, performed a quick sanity check to see if we did all of these computations correctly.
- Clicked on cell A1 and click on the Data tab. Selected the Filter icon (looks like a funnel). Little drop-down arrows appeared at the top of each column. This enabled us to filter the data according to criteria we set.
- Clicked on the drop-down arrow on dCIN5_p-value column. Select "Number Filters". Set a criterion that filtered data so that the p value has to be less than 0.05.
- Excel only displayed the rows that correspond to data meeting that filtering criterion. A number appeared in the lower left hand corner of the window giving you the number of rows that meet that criterion. We will check our results with each other to make sure that the computations were performed correctly.
- Undid any filters that you have applied before making any additional calculations.

Calculated the Bonferroni and p value Correction

Performed adjustments to the p value to correct for the multiple testing problem. Labeled the next two columns to the right with the same label, dCIN5_Bonferroni_p-value.
Typed the equation =<dCIN5_p-value>*6189, Upon completion of this single computation, used the Step (10) trick to copy the formula throughout the column.
Replaced any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second dCIN5_Bonferroni_p-value header: =IF(dCIN5_Bonferroni_p-value>1,1,(STRAIN)_Bonferroni_p-value), where "dCIN5_Bonferroni_p-value" referred to the cell in which the first Bonferroni p value computation was made. Used the Step (10) trick to copy the formula throughout the column.

Calculated the Benjamini & Hochberg p value Correction

Inserted a new worksheet named "dCIN5_ANOVA_B-H".
Copied and pasted the "MasterIndex", "ID", and "Standard Name" columns from previous worksheet into first two columns of the new worksheet.
Used Paste special > Paste values. Copied unadjusted p values from ANOVA worksheet and pasted it into Column D.
Selected all of columns A, B, C, and D. Sorted by ascending values on Column D. Clicked the sort button from A to Z on the toolbar, in the window that appears, sorted by column D, smallest to largest.
Typed the header "Rank" in cell E1. Created a series of numbers in ascending order from 1 to 6189 in this column. This is the p value rank, smallest to largest. Type "1" into cell E2 and "2" into cell E3. Selected both cells E2 and E3. Double-clicked on the plus sign on the lower right-hand corner of your selection to fill the column with a series of numbers from 1 to 6189.
Calculated the Benjamini and Hochberg p value correction. Type dCIN5_B-H_p-value in cell F1. Typed the following formula in cell F2: =(D2*6189)/E2 and press enter. Copied that equation to the entire column.
Typed "dCIN5 _B-H_p-value" into cell G1.
Typed the following formula into cell G2: =IF(F2>1,1,F2) and pressed enter. Copied that equation to the entire column.
Selected columns A through G. Sorted them by MasterIndex in Column A in ascending order.
Copied column G and used Paste special > Paste values to paste it into the next column on the right of your ANOVA sheet.

Zipped and uploaded the .xlsx file that you have just created to the wiki.

Sanity Check: Number of genes significantly changed

Went to our dCIN5_ANOVA worksheet.
Selected row 1 and selected the menu item Data > Filter > Autofilter. Little drop-down arrows appeared at the top of each column. This enabled us to filter the data according to criteria we set.
We clicked on the drop-down arrow for the unadjusted p value and set a criterion that filtered the data so that the p value is less than 0.05. These results are also reported in the slide.
- How many genes have p < 0.05? and what is the percentage (out of 6189)? 2290
- How many genes have p < 0.01? and what is the percentage (out of 6189)? 1380
- How many genes have p < 0.001? and what is the percentage (out of 6189)? 691
- How many genes have p < 0.0001? and what is the percentage (out of 6189)? 358
We created a new worksheet in our workbook to record the answers to these questions so that we can write a formula in Excel to automatically calculate the percentage for you.
When we used a p value cut-off of p < 0.05, what we are saying is that we would have seen a gene expression change that deviates this far from zero by chance less than 5% of the time.
We have just performed 6189 hypothesis tests. Another way to state what we are seeing with p < 0.05 is that we would expect to see this a gene expression change for at least one of the timepoints by chance in about 5% of our tests, or 309 times. Since we have more than 309 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, we filtered the data to determine the following:
- How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)? 151
- How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 6189)? 1453
In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.
We uploaded our slide to the wiki including our data from the percentage of genes.
- Since the wild type data is being analyzed by one of the groups in the class, it will be sufficient for this week to supply just the data for our strain. We will do the comparison with wild type at a later date.
- Found NSR1 in your dataset and recorded the p-value for each field below.
  - Unadjusted: 6.38E-8
  - Bonferroni-corrected:0.00039
  - B-H-corrected: 2.192E-5
  - Average Log fold change: t15-4.07, t30-3.61, t60-4.30, t90- -2.90, t120- -0.931
  - NSR1 changes expression due to cold shock in this experiment.
- Favorite Gene: CMR2
  - Unadjusted: 0.117
  - Bonferroni-corrected: 723.7
  - B-H-corrected: 0.240
  - Average Log fold change: t15- -0.477, t30- -0.377, t60- -0.834, t90-0.096, t120- -0.244
  - CMR2 changes gene expression due to cold shock

Clustering and GO Term Enrichment with Stem (Part 2)

Prepared microarray data file for loading into STEM.

- Inserted a new worksheet into Excel workbook, and named it "dCIN5_stem".
- Selected all of the data from your "dCIN5_ANOVA" worksheet and Paste special > paste values into "dCIN5_stem" worksheet.
- Leftmost column had the column header "Master_Index". Renamed this column to "SPOT". Column B was named "ID". Renamed this column to "Gene Symbol". Deleted the column named "Standard_Name".
- Filtered the data on the B-H corrected p value to be > 0.05 (that's greater than in this case).
- Once the data has been filtered, selected all rows (except for header row) and deleted the rows by right-clicking and choosing "Delete Row" from the context menu. Undo the filter. This ensures that we will cluster only the genes with a "significant" change in expression and not the noise.
- Deleted all of the data columns EXCEPT for the Average Log Fold change columns for each timepoint (for example, wt_AvgLogFC_t15, etc.).
- Renamed the data columns with just the time and units (for example, 15m, 30m, etc.).
- Saved work. Then used Save As to save this spreadsheet as Text (Tab-delimited) (*.txt). Clicked OK to the warnings and close your file.

Downloaded and extracted the STEM software. Click here to go to the STEM web site.

- Clicked on the download link and downloaded the stem.zip file to Desktop.
- Unzipped the file.
- This will create a folder called stem.
- Downloaded the Gene Ontology and yeast GO annotations and placed them in this folder.
- Downloaded "gene_ontology.obo".
- Downloaded "gene_association.sgd.gz".
- Inside the folder, double-clicked on the stem.jar to launch the STEM program.

Ran STEM

- In section 1 (Expression Data Info) of the the main STEM interface window, clicked on the Browse... button to navigate to and select your file.
- Clicked on the radio button No normalization/add 0.
- Checked the box next to Spot IDs included in the data file.
- In section 2 (Gene Info) of the main STEM interface window, left the default selection for the three drop-down menu selections for Gene Annotation Source, Cross Reference Source, and Gene Location Source as "User provided".
- Clicked the "Browse..." button to the right of the "Gene Annotation File" item. Browse to your "stem" folder and select the file "gene_association.sgd.gz" and click Open.
- In section 3 (Options) of the main STEM interface window, made sure that the Clustering Method said "STEM Clustering Method" and did not change the defaults for Maximum Number of Model Profiles or Maximum Unit Change in Model Profiles between Time Points.
- In section 4 (Execute) clicked on the yellow Execute button to run STEM.
  - Re-opened file and opened the Find/Replace dialog. Searched for #DIV/0!, but didn’t put anything in the replace field. Clicked "Replace all" to remove the #DIV/0! errors. Then saved file and tried again with stem.

Data & Files

DM_dCIN5 Sanity Check

DM_dCIN5 text

DM_dCIN5.xlsx

Conclusion

From performing many different statistical analyses, the gene expressions were investigated to determine if the gene expression change in strain dCIN5 was significantly different from zero based on their calculated p-values. In looking at the analysis thus far, many of the genes present in dCIN were able to change their gene expression after cold shock based on their average Log fold changes overtime. Although not finished yet, the p-values calculated so far have shownthat at least 2300 of the genes expressed exhibited a p-value less than 0.05, indicating there was a significant difference from zero at any time point.

Acknowledgements

I worked with my homework group Ivy, Mihir, and Emma this week in class to complete this assignment. We talked about the assignment in class and texted about it as well if anyone needed help.
"Except for what is noted above, this individual journal entry was completed by me and not copied from another source."
Dmadere (talk) 00:09, 24 October 2019 (PDT)

References

Methodology as provided and edited from the assignment page, step-by-step instrutions, and assignment updates as listed by LMU BioDB 2019. (2019). Week 8. Retrieved October 23, 2019, from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_8
Gene identifiers as retrieved from Saccharomyces Genome Database

My Page

Assignments	Journal Entries	Shared Journal
Week 1	Dmadere Week 1	Class Journal Week 1
Week 2	Dmadere Week 2	Class Journal Week 2
Week 3	CMR2/YOR093C Week 3	Class Journal Week 3
Week 4	Dmadere Week 4	Class Journal Week 4
Week 5	CancerSEA Week 5	Class Journal Week 5
Week 6	Dmadere Week 6	Class Journal Week 6
Week 7	Dmadere Week 7	Class Journal Week 7
Week 8	Dmadere Week 8	Class Journal Week 8
Week 9	Dmadere Week 9	Class Journal Week 9
Week 10	Dmadere Week 10	Class Journal Week 10
Week 11	Dmadere Week 11	Sulfiknights
Week 12/13	Dmadere Week 12/13	Sulfiknights
Week 15	Dmadere Week 15	Sulfiknights

Template:Dmadere

Dmadere Week 8

Contents

Purpose

Method/Results

Statistical Analysis Part1: ANOVA

Clustering and GO Term Enrichment with Stem (Part 2)

Data & Files

Conclusion

Acknowledgements

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools