Difference between revisions of "Emmatyrnauer Week 8"

From LMU BioDB 2017
Jump to: navigation, search
(Acknowledgements: typo)
(Sanity Check: Number of genes significantly changed: adding and typo)
 
(26 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
==Microarray Data Analysis==
 
==Microarray Data Analysis==
#Download the starting Excel spreadsheet from Box.  
+
#I downloaded the starting Excel spreadsheet from Box.  
#Change filename so it contains your initials
+
# I change the filename so it contained my initials
#Begin by recording in your wiki, the strain comparison and individual dataset that you will analyze, the filename, the number of replicates for each strain and each time point in your data <code>This week I analyzed the wild type dataset (file name: wildtype_analysis_ET.xslx). This dataset has 5 time points: 15 min (4 replicates), 30 min (5 replicates), 60 min (4 replicates), 90 min (5 replicates), and 120 min (5 replicates).</code>
+
# I begin by recording in my wiki, the strain comparison and individual dataset that I analyzed, the filename, the number of replicates for each strain and each time point in my data: This week I analyzed the wild type dataset (file name: wildtype_analysis_ET.xslx). <span style="color:red">This dataset has 5 time points: 15 min (4 replicates), 30 min (5 replicates), 60 min (4 replicates), 90 min (5 replicates), and 120 min (5 replicates).</span>
#Delete the data columns of the strains that you are not analyzing (all except wild type) so that your file contains only the strain's data that you will be working with.
+
# I deleted the data columns of the strains that I was not analyzing (all except wild type) so that my file contained only the strain's data that I was working with.
#Replace cells that have "NA" in them (which indicates missing data) with an empty cell by using the keyboard shortcut Control+F to open the "Find" dialog box and select the "Replace" tab.
+
# I replaced cells that have "NA" in them (which indicates missing data) with an empty cell by using the keyboard shortcut Control+F to open the "Find" dialog box and selecting the "Replace" tab).
#Type "NA" in the Search field and don't type anything in the "Replace" field. Click the button "Replace all" (4109 replacements for NA were made).
+
# I typed "NA" in the search field and didn't type anything in the "Replace" field. I clicked the button "Replace all" <span style="color:red">4109 replacements for NA were made</span>
#Save early and often throughout this protocol.
 
  
 
===Part 1: Statistical Analysis Part 1===
 
===Part 1: Statistical Analysis Part 1===
# Create a new worksheet, naming it "wt_ANOVA"  
+
# I created a new worksheet, naming it "wt_ANOVA"  
# Copy the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for wt and paste it into your new worksheet.  Copy the columns containing the data for your strain and paste it into your new worksheet.
+
# I copied the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for wt and pasted it into my new worksheet.  I then copied the columns containing the data for my strain and pasted it into my new worksheet.
# At the top of the first column to the right of your data, create five column headers of the form wt_AvgLogFC_(TIME) where (TIME) is 15, 30, etc.
+
# At the top of the first column to the right of my data, I created five column headers of the form wt_AvgLogFC_(TIME) where (TIME) is 15, 30, etc.
# In the cell below the wt_AvgLogFC_t15 header, type <code>=AVERAGE(</code>  
+
# In the cell below the wt_AvgLogFC_t15 header, I typed <code>=AVERAGE(</code>  
# Then highlight all the data in row 2 associated with wt and t15, press the closing paren key (shift 0), and press the "enter" key.
+
# I then highlighted all the data in row 2 associated with wt and t15, pressed the closing paren key (shift 0), and pressed the "enter" key.<code>=AVERAGE(D2:G2)</code>
# This cell now contains the average of the log fold change data from the first gene at t=15 minutes.
+
# This cell now contained the average of the log fold change data from the first gene at t=15 minutes.
# Click on this cell and position your cursor at the bottom right corner. You should see your cursor change to a thin black plus sign (not a chubby white one). When it does, double click, and the formula will magically be copied to the entire column of 6188 other genes.
+
# I clicked on this cell and positioned my cursor at the bottom right corner. I say my cursor change to a thin black plus sign (not a chubby white one). When it did, I double clicked , and the formula was copied to the entire column of 6188 other genes.
# Repeat steps (4) through (8) with the t30, t60, t90, and the t120 data.
+
# I repeated steps (4) through (8) with the t30, t60, t90, and the t120 data.
# Now in the first empty column to the right of the wt_AvgLogFC_t120 calculation, create the column header wt_ss_HO.
+
# Now in the first empty column to the right of the wt_AvgLogFC_t120 calculation, I created the column header wt_ss_HO.
# In the first cell below this header, type <code>=SUMSQ(</code>
+
# In the first cell below this header, I typed <code>=SUMSQ(</code>
# Highlight all the LogFC data in row 2 for your wt (but not the AvgLogFC), press the closing parenthesis key (shift 0), and press the "enter" key.  
+
# I highlighted all the LogFC data in row 2 for my wt (but not the AvgLogFC), pressed the closing parenthesis key (shift 0), and pressed the "enter" key. <code>=SUMSQ(D2:Z2)</code>
# In the next empty column to the right of wt_ss_HO, create the column headers wt_ss_(TIME) as in (3).
+
# In the next empty column to the right of wt_ss_HO, I created the column headers wt_ss_(TIME) as in (3).
# Make a note of how many data points you have at each time point for your strain. For wild type it will be "4" or "5". Count carefully. Also, make a note of the total number of data points. For wt it should be 23.
+
# I made a note of how many data points I had at each time point for my strain. For wild type it was "4" or "5". Total number of data points for wt was 23.
# In the first cell below the header wt_ss_t15, type <code>=SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2</code> and hit enter.
+
# In the first cell below the header wt_ss_t15, I typed <code>=SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2</code> and hit enter.<code>=SUMSQ(D2:G2)-COUNTA(D2:G2)*AA2^2</code>
#* The <code>COUNTA</code> function counts the number of cells in the specified range that have data in them (i.e., does not count cells with missing values).
+
#* The <code>COUNTA</code> function counted the number of cells in the specified range that had data in them (i.e., did not count cells with missing values).
#* The phrase <range of cells for logFC_t15> should be replaced by the data range associated with t15.  
+
#* The phrase <range of cells for logFC_t15> was replaced by the data range associated with t15 (2D:2G).  
#* The phrase <AvgLogFC_t15> should be replaced by the cell number in which you computed the AvgLogFC for t15, and the "^2" squares that value.  
+
#* The phrase <AvgLogFC_t15> was replaced by the cell number in which I computed the AvgLogFC for t15, and the "^2" squares that value.  
#* Upon completion of this single computation, use the Step (7) trick to copy the formula throughout the column.
+
#* Upon completion of this single computation, I used the Step (7) trick to copy the formula throughout the column.
# Repeat this computation for the t30 through t120 data points.  Again, be sure to get the data for each time point, type the right number of data points, and get the average from the appropriate cell for each time point, and copy the formula to the whole column for each computation.
+
# I repeated this computation for the t30 through t120 data points.  Again, I made sure to get the data for each time point, type the right number of data points, and get the average from the appropriate cell for each time point, and copied the formula to the whole column for each computation.
# In the first column to the right of wt_ss_t120, create the column header wt_SS_full.
+
# In the first column to the right of wt_ss_t120, I created the column header wt_SS_full.
# In the first row below this header, type <code>=sum(<range of cells containing "ss" for each timepoint>)</code> and hit enter.
+
# In the first row below this header, I typed <code>=sum(<range of cells containing "ss" for each timepoint>)</code> and hit enter.<code>=sum(AG2:AK2)</code>
# In the next two columns to the right, create the headers wt_Fstat and wt_p-value.
+
# In the next two columns to the right, I created the headers wt_Fstat and wt_p-value.
# Recall the number of data points from (13): call that total n. (n=23 for wt)
+
# I recalled the number of data points from (13) and called that total n. (n=23 for wt)
# In the first cell of the wt_Fstat column, type <code>=((n-5)/5)*(<wt_ss_HO>-<wt_SS_full>)/<wt_SS_full></code> and hit enter.
+
# In the first cell of the wt_Fstat column, I typed <code>=((n-5)/5)*(<wt_ss_HO>-<wt_SS_full>)/<wt_SS_full></code> and hit enter. <code>=((23-5)/5)*(AF2-AL2)/AL2</code>
#* Don't actually type the n but instead use the number from (13). That is 23.  
+
#* I didn't actually type the n but instead used the number from (13). That is, 23.  
#* Replace the phrase wt_ss_HO with the cell designation.
+
#* I replaced the phrase wt_ss_HO with the cell designation.
#* Replace the phrase <wt_SS_full> with the cell designation.  
+
#* I replaced the phrase <wt_SS_full> with the cell designation.  
#* Copy to the whole column.
+
#* I copied to the whole column.
# In the first cell below the wt_p-value header, type <code>=FDIST(<wt_Fstat>,5,23-5)</code> and copy to the whole column.
+
# In the first cell below the wt_p-value header, I typed <code>=FDIST(<wt_Fstat>,5,23-5)</code> and copied to the whole column. <code>=FDIST(AM2,5,23-5)</code>
# Before we move on to the next step, we will perform a quick sanity check to see if we did all of these computations correctly.
+
# Before I moved on to the next step, I performed a quick sanity check to see if I did all of these computations correctly.
#* Click on cell A1 and click on the Data tab.  Select the Filter icon (looks like a funnel). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
+
#* I clicked on cell A1 and clicked on the Data tab.  I selected the filter icon (looked like a funnel). Little drop-down arrows appeared  at the top of each column. This enabled me to filter the data according to criteria I set.
#* Click on the drop-down arrow on your wt_p-value column. Select "Number Filters". In the window that appears, set a criterion that will filter your data so that the p value has to be less than 0.05.  
+
#* I clicked on the drop-down arrow on my wt_p-value column. I selected "Number Filters". In the window that appeared, I set a criterion that filtered my data so that the p value had to be less than 0.05.  
#* Excel will now only display the rows that correspond to data meeting that filtering criterion.  A number will appear in the lower left hand corner of the window giving you the number of rows that meet that criterion.  We will check our results with each other to make sure that the computations were performed correctly.
+
#* Excel now only displayed the rows that corresponded to data meeting that filtering criterion.  A number appeared in the lower left hand corner of the window giving me the number of rows that met that criterion.
  
===Calculate the Bonferroni and p value Correction===
+
===Calculating the Bonferroni and p value Correction===
# Now we will perform adjustments to the p value to correct for the [https://xkcd.com/882/ multiple testing problem].  Label the next two columns to the right with the same label, wt_Bonferroni_p-value.
+
# Now I performed adjustments to the p value to correct for the [https://xkcd.com/882/ multiple testing problem].  I labeled the next two columns to the right with the same label, wt_Bonferroni_p-value.
# Type the equation <code>=<wt_p-value>*6189</code>, Upon completion of this single computation, use the Step (10) trick to copy the formula throughout the column.
+
# I typed the equation <code>=<wt_p-value>*6189</code>, Upon completion of this single computation, I used the Step (10) trick to copy the formula throughout the column.<code>=AN2*6189</code>
# Replace any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second wt_Bonferroni_p-value header: <code>=IF(wt_Bonferroni_p-value>1,1,wt_Bonferroni_p-value)</code>, where "STRAIN_Bonferroni_p-value" refers to the cell in which the first Bonferroni p value computation was made.  Use the Step (10) trick to copy the formula throughout the column.
+
# I replaced any corrected p value that was greater than 1 by the number 1 by typing the following formula into the first cell below the second wt_Bonferroni_p-value header: <code>=IF(wt_Bonferroni_p-value>1,1,wt_Bonferroni_p-value)</code>, where "STRAIN_Bonferroni_p-value" refered to the cell in which the first Bonferroni p value computation was made.  I used the Step (10) trick to copy the formula throughout the column.<code>=IF(AO2>1,1,AO2)</code>
  
===Calculate the Benjamini & Hochberg p value Correction===
+
===Calculating the Benjamini & Hochberg p value Correction===
# Insert a new worksheet named "wt_ANOVA_B-H".
+
# I inserted a new worksheet named "wt_ANOVA_B-H".
# Copy and paste the "MasterIndex", "ID", and "Standard Name" columns from your previous worksheet into the first two columns of the new worksheet.  
+
# I copied and pasted the "MasterIndex", "ID", and "Standard Name" columns from my previous worksheet into the first two columns of the new worksheet.  
# For the following, use Paste special > Paste values.  Copy your unadjusted p values from your ANOVA worksheet and paste it into Column D.
+
# For the following, I Pasted special > Paste values.  I copied my unadjusted p values from my ANOVA worksheet and pasted it into Column D.
# Select all of columns A, B, C, and D. Sort by ascending values on Column D. Click the sort button from A to Z on the toolbar, in the window that appears, sort by column D, smallest to largest.
+
# I selected all of columns A, B, C, and D. I sorted by ascending values on Column D. I clicked the sort button from A to Z on the toolbar, and in the window that appeared, I sorted by column D, smallest to largest.
# Type the header "Rank" in cell E1.  We will create a series of numbers in ascending order from 1 to 6189 in this column.  This is the p value rank, smallest to largest.  Type "1" into cell E2 and "2" into cell E3. Select both cells E2 and E3. Double-click on the plus sign on the lower right-hand corner of your selection to fill the column with a series of numbers from 1 to 6189.
+
# I typed the header "Rank" in cell E1.  I created a series of numbers in ascending order from 1 to 6189 in this column.  This was the p value rank, smallest to largest.  I typed "1" into cell E2 and "2" into cell E3. I selected both cells E2 and E3. I then double-clicked on the plus sign on the lower right-hand corner of my selection to fill the column with a series of numbers from 1 to 6189.
# Now you can calculate the Benjamini and Hochberg p value correction. Type wt_B-H_p-value in cell F1. Type the following formula in cell F2: <code>=(D2*6189)/E2</code> and press enter. Copy that equation to the entire column.
+
# I then calculated the Benjamini and Hochberg p value correction. I typed wt_B-H_p-value in cell F1. I then typed the following formula in cell F2: <code>=(D2*6189)/E2</code> and pressed enter. I copied that equation to the entire column.
# Type "wt_B-H_p-value" into cell G1.  
+
# I typed "wt_B-H_p-value" into cell G1.  
# Type the following formula into cell G2: <code>=IF(F2>1,1,F2)</code> and press enter. Copy that equation to the entire column.  
+
# I typed the following formula into cell G2: <code>=IF(F2>1,1,F2)</code> and pressed enter. I copied that equation to the entire column.  
# Select columns A through G.  Now sort them by your MasterIndex in Column A in ascending order.
+
# I selected columns A through G.  I then sorted them by my MasterIndex in Column A in ascending order.
# Copy column G and use Paste special > Paste values to paste it into the next column on the right of your ANOVA sheet.
+
# I copied column G and used Paste special > Paste values to paste it into the next column on the right of my ANOVA sheet.
 +
 
 +
====File====
 +
[[Media:Wt_Microarraydata_ET.zip]]
  
 
===Sanity Check: Number of genes significantly changed===
 
===Sanity Check: Number of genes significantly changed===
Before we move on to further analysis of the data, we want to perform a more extensive sanity check to make sure that we performed our data analysis correctly.  We are going to find out the number of genes that are significantly changed at various p value cut-offs.
+
Before moving on to further analysis of the data, I wanted to perform a more extensive sanity check to make sure that I performed my data analysis correctly.  I found out the number of genes that were significantly changed at various p value cut-offs.
  
* Go to your (STRAIN)_ANOVA worksheet.
+
* I went to wt_ANOVA worksheet.
* Select row 1 (the row with your column headers) and select the menu item Data > Filter > Autofilter (The funnel icon on the Data tab).  Little drop-down arrows should appear at the top of each column.  This will enable us to filter the data according to criteria we set.
+
* I selected row 1 (the row with my column headers) and selected the menu item Data > Filter > Autofilter (The funnel icon on the Data tab).  Little drop-down arrows appeared at the top of each column.  This enabled me to filter the data according to criteria I set.
* Click on the drop-down arrow for the unadjusted p value.  Set a criterion that will filter your data so that the p value has to be less than 0.05.
+
* I clicked on the drop-down arrow for the unadjusted p value.  I set a criterion that filtered my data so that the p value had to be less than 0.05.
** '''''How many genes have p < 0.05?  and what is the percentage (out of 6189)?'''''
+
** '''''How many genes have p < 0.05?  and what is the percentage (out of 6189)?''''' <span style="color:red">2528; 40.85%</span>
** '''''How many genes have p < 0.01? and what is the percentage (out of 6189)?'''''
+
** '''''How many genes have p < 0.01? and what is the percentage (out of 6189)?''''' <span style="color:red">1652; 26.70%</span>
** '''''How many genes have p < 0.001? and what is the percentage (out of 6189)?'''''
+
** '''''How many genes have p < 0.001? and what is the percentage (out of 6189)?''''' <span style="color:red">919; 14.85%</span>
** '''''How many genes have p < 0.0001? and what is the percentage (out of 6189)?'''''
+
** '''''How many genes have p < 0.0001? and what is the percentage (out of 6189)?''''' <span style="color:red">496; 8.0142%</span>
 
* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero by chance less than 5% of the time.
 
* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero by chance less than 5% of the time.
* We have just performed 6189 hypothesis tests.  Another way to state what we are seeing with p < 0.05 is that we would expect to see this a gene expression change for at least one of the timepoints by chance in about 5% of our tests, or 309 times.  Since we have more than 309 genes that pass this cut off, we know that some genes are significantly changed.  However, we don't know ''which'' ones.  To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values.  The Bonferroni correction is very stringent.  The Benjamini-Hochberg correction is less stringent.  To see this relationship, filter your data to determine the following:
+
* I just performed 6189 hypothesis tests.  Another way to state what we are seeing with p < 0.05 is that we would expect to see this a gene expression change for at least one of the timepoints by chance in about 5% of our tests, or 309 times.  Since we have more than 309 genes that pass this cut off, we know that some genes are significantly changed.  However, we don't know ''which'' ones.  To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values.  The Bonferroni correction is very stringent.  The Benjamini-Hochberg correction is less stringent.  To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)?'''''
+
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)?''''' <span style="color:red">248; 4.01%</span>
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 6189)?'''''
+
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 6189)?''''' <span style="color:red">1822; 29.44%</span>
 
* In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant".  Instead, it is a moveable confidence level.  If we want to be very confident of our data, use a small p value cut-off.  If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.   
 
* In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant".  Instead, it is a moveable confidence level.  If we want to be very confident of our data, use a small p value cut-off.  If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.   
 
* We will compare the numbers we get between the wild type strain and the other strains studied, organized as a table.  Use this [[Media:BIOL367_F17_sample_p-value_slide.pptx | sample PowerPoint slide]] to see how your table should be formatted. Upload your slide to the wiki.
 
* We will compare the numbers we get between the wild type strain and the other strains studied, organized as a table.  Use this [[Media:BIOL367_F17_sample_p-value_slide.pptx | sample PowerPoint slide]] to see how your table should be formatted. Upload your slide to the wiki.
 
** Note that since the wild type data is being analyzed by one of the groups in the class, it will be sufficient for this week to supply just the data for your strain.  We will do the comparison with wild type at a later date.
 
** Note that since the wild type data is being analyzed by one of the groups in the class, it will be sufficient for this week to supply just the data for your strain.  We will do the comparison with wild type at a later date.
* Comparing results with known data:  the expression of the gene ''NSR1'' (ID: YGR159C)is known to be induced by cold shock. '''''Find ''NSR1'' in your dataset.  What is its unadjusted, Bonferroni-corrected, and B-H-corrected p values? What is its average Log fold change at each of the timepoints in the experiment?''''' Note that the average Log fold change is what we called "STRAIN)_AvgLogFC_(TIME)" in step 3 of the ANOVA analysis. Does ''NSR1'' change expression due to cold shock in this experiment?  
+
* Comparing results with known data:  the expression of the gene ''NSR1'' (ID: YGR159C)is known to be induced by cold shock. '''''Find ''NSR1'' in your dataset.''''' <span style="color:red">Unadjusted p-value: 2.86901E-10, Bonferroni-corrected p-value: 1.77563E-06, and B-H-corrected p-value: 7.52066E-10. Average Log fold change at t15=3.279, t30=3.621, t60=3.527, t90=-2.050, t120=-0.6062</span> Note that the average Log fold change is what we called "STRAIN)_AvgLogFC_(TIME)" in step 3 of the ANOVA analysis. Does ''NSR1'' change expression due to cold shock in this experiment? <span style="color:red">Since the Log Fold Change is a log2 value, a Log Fold Change greater than or equal to 1, or less than or equal to -1 has at least a 2 fold difference in expression (meaning the gene is changing expression). Because the log fold change meets these parameters at t15 (3.279), t30 (3.621), t60 (3.527), and t90 (-2.050) the gene definitely changes expression. Because NSR1 has a Bonferroni-corrected p value < 0.05, it is among the genes that are the most significantly changed in the dataset because this parameter is the most selective (therefore, we have a lot of confidence that expression in this gene is changing). </span>
* For fun, find "your favorite gene" (from your web page) in the dataset. '''''What is its unadjusted, Bonferroni-corrected, and B-H-corrected p values?  What is its average Log fold change at each of the timepoints in the experiment?'''''  Does your favorite gene change expression due to cold shock in this experiment?
+
* For fun, find "your favorite gene-ADA2" (from your web page) in the dataset. <span style="color:red">Unadjusted p-value: 0.008083323, Bonferroni-corrected p-value: 1, and B-H-corrected p-value: 1. Average Log fold change at t15=-1.063078929, t30=-0.44188769, t60=-0.999519175, t90=-1.265472205, t120=-0.479548402. My favorite gene changes expression due to cold shock because t15=-1.063078929, and t90=-1.265472205 (both less than -1). Unlike NSR1 which has a Bonferroni-corrected p value < 0.05, ADA2's unadjusted p-value is ~0.008 which becomes greater than 0.05 in both the Bonferroni and B-H corrections (less confidence that expression in this gene is changing). So comparing NSR1 to ADA2, NSR1 has more significant changes in expression compared to ADA2. </span>
 +
 
 +
 
 +
====Powerpoint Slide====
 +
[[Media:WT_p-value_slide_ET.pptx]]
  
 
===Summary Paragraph===
 
===Summary Paragraph===
Write a summary paragraph that gives the conclusions from this week's analysis.
+
In this week's assignment, microarray data for wild type Saccharomyces (with ratios that had previously been normalized using a statistics package) were analyzed. The data included the log fold change of the red/green ratio at different time points: t15, t30, t60 (cold shock at 13°C) and t90 and t120 (cold shock at 13°C followed by 30 or 60 minutes of recovery at 30°C). Multiple measurements were taken at each time point and statistical analysis was performed on the data to determine the effects of cold shock on the alteration of gene expression. Data analysis included calculations of the unadjusted, Bonferroni-corrected, and B-H-corrected p-values. Changes in gene expression in response to cold shock were recorded from 40.9%, 26.7%, 14.9% and 8.0% of genes with limitations of p < 0.05, p < 0.01, p < 0.001, and p < 0.0001, respectively. Based on this data, less than half of the genes examined are changing expression (specifically, only 40.9% of the genes--and this is when we are designating changes in expression as p < 0.05). However, even less are said to be changing expression when p < 0.05, and only 8.0% of genes are said to be changing expression if we set p < 0.0001. We can have a lot of confidence that  Bonferroni-corrected (most stringent), and B-H-corrected p-values (less stringent) were calculated to determine how significantly some genes were affected over others (4.0071% and 32.4123%, respectively displaying a change in gene expression). These p-values help measure confidence level of changes in gene expression (smaller p cutoff corresponds to higher confidence while larger p cutoff corresponds to lower confidence). When comparing NSR1 data to ADA2 (my favorite gene) I concluded that ADA2 was not affected much by cold shock due to the lack of log fold change between time points (remained negative throughout).
  
 
==Acknowledgements==
 
==Acknowledgements==
#I worked with my homework partner [[User:cazinge|Eddie Azinge]] outside of class. We met face-to-face to compare our data analysis.
+
#I worked with my homework partner [[User:cazinge|Eddie Azinge]] outside of class. We met face-to-face to compare our data.
 +
#[[User:Kdahlquist|Dr. Dahlquist]] for teaching and assisting us with the data analysis  
 +
#Microsoft Excel to allow for statistical analysis
 +
#I copied and modified the instructions from the [[Week 8]] assignment page.
 +
#I analyzed Dr. Dahlquist's microarray data. The excel file was downloaded from the [[Week 8]] assignment page.
 
#While I worked with the people noted above, this individual journal entry was completed by me and not copied from another source.
 
#While I worked with the people noted above, this individual journal entry was completed by me and not copied from another source.
 
[[User:Emmatyrnauer|Emmatyrnauer]] ([[User talk:Emmatyrnauer|talk]]) 15:52, 23 October 2017 (PDT)
 
[[User:Emmatyrnauer|Emmatyrnauer]] ([[User talk:Emmatyrnauer|talk]]) 15:52, 23 October 2017 (PDT)

Latest revision as of 03:44, 21 November 2017

Electronic Notebook

Microarray Data Analysis

  1. I downloaded the starting Excel spreadsheet from Box.
  2. I change the filename so it contained my initials
  3. I begin by recording in my wiki, the strain comparison and individual dataset that I analyzed, the filename, the number of replicates for each strain and each time point in my data: This week I analyzed the wild type dataset (file name: wildtype_analysis_ET.xslx). This dataset has 5 time points: 15 min (4 replicates), 30 min (5 replicates), 60 min (4 replicates), 90 min (5 replicates), and 120 min (5 replicates).
  4. I deleted the data columns of the strains that I was not analyzing (all except wild type) so that my file contained only the strain's data that I was working with.
  5. I replaced cells that have "NA" in them (which indicates missing data) with an empty cell by using the keyboard shortcut Control+F to open the "Find" dialog box and selecting the "Replace" tab).
  6. I typed "NA" in the search field and didn't type anything in the "Replace" field. I clicked the button "Replace all" 4109 replacements for NA were made

Part 1: Statistical Analysis Part 1

  1. I created a new worksheet, naming it "wt_ANOVA"
  2. I copied the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for wt and pasted it into my new worksheet. I then copied the columns containing the data for my strain and pasted it into my new worksheet.
  3. At the top of the first column to the right of my data, I created five column headers of the form wt_AvgLogFC_(TIME) where (TIME) is 15, 30, etc.
  4. In the cell below the wt_AvgLogFC_t15 header, I typed =AVERAGE(
  5. I then highlighted all the data in row 2 associated with wt and t15, pressed the closing paren key (shift 0), and pressed the "enter" key.=AVERAGE(D2:G2)
  6. This cell now contained the average of the log fold change data from the first gene at t=15 minutes.
  7. I clicked on this cell and positioned my cursor at the bottom right corner. I say my cursor change to a thin black plus sign (not a chubby white one). When it did, I double clicked , and the formula was copied to the entire column of 6188 other genes.
  8. I repeated steps (4) through (8) with the t30, t60, t90, and the t120 data.
  9. Now in the first empty column to the right of the wt_AvgLogFC_t120 calculation, I created the column header wt_ss_HO.
  10. In the first cell below this header, I typed =SUMSQ(
  11. I highlighted all the LogFC data in row 2 for my wt (but not the AvgLogFC), pressed the closing parenthesis key (shift 0), and pressed the "enter" key. =SUMSQ(D2:Z2)
  12. In the next empty column to the right of wt_ss_HO, I created the column headers wt_ss_(TIME) as in (3).
  13. I made a note of how many data points I had at each time point for my strain. For wild type it was "4" or "5". Total number of data points for wt was 23.
  14. In the first cell below the header wt_ss_t15, I typed =SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2 and hit enter.=SUMSQ(D2:G2)-COUNTA(D2:G2)*AA2^2
    • The COUNTA function counted the number of cells in the specified range that had data in them (i.e., did not count cells with missing values).
    • The phrase <range of cells for logFC_t15> was replaced by the data range associated with t15 (2D:2G).
    • The phrase <AvgLogFC_t15> was replaced by the cell number in which I computed the AvgLogFC for t15, and the "^2" squares that value.
    • Upon completion of this single computation, I used the Step (7) trick to copy the formula throughout the column.
  15. I repeated this computation for the t30 through t120 data points. Again, I made sure to get the data for each time point, type the right number of data points, and get the average from the appropriate cell for each time point, and copied the formula to the whole column for each computation.
  16. In the first column to the right of wt_ss_t120, I created the column header wt_SS_full.
  17. In the first row below this header, I typed =sum(<range of cells containing "ss" for each timepoint>) and hit enter.=sum(AG2:AK2)
  18. In the next two columns to the right, I created the headers wt_Fstat and wt_p-value.
  19. I recalled the number of data points from (13) and called that total n. (n=23 for wt)
  20. In the first cell of the wt_Fstat column, I typed =((n-5)/5)*(<wt_ss_HO>-<wt_SS_full>)/<wt_SS_full> and hit enter. =((23-5)/5)*(AF2-AL2)/AL2
    • I didn't actually type the n but instead used the number from (13). That is, 23.
    • I replaced the phrase wt_ss_HO with the cell designation.
    • I replaced the phrase <wt_SS_full> with the cell designation.
    • I copied to the whole column.
  21. In the first cell below the wt_p-value header, I typed =FDIST(<wt_Fstat>,5,23-5) and copied to the whole column. =FDIST(AM2,5,23-5)
  22. Before I moved on to the next step, I performed a quick sanity check to see if I did all of these computations correctly.
    • I clicked on cell A1 and clicked on the Data tab. I selected the filter icon (looked like a funnel). Little drop-down arrows appeared at the top of each column. This enabled me to filter the data according to criteria I set.
    • I clicked on the drop-down arrow on my wt_p-value column. I selected "Number Filters". In the window that appeared, I set a criterion that filtered my data so that the p value had to be less than 0.05.
    • Excel now only displayed the rows that corresponded to data meeting that filtering criterion. A number appeared in the lower left hand corner of the window giving me the number of rows that met that criterion.

Calculating the Bonferroni and p value Correction

  1. Now I performed adjustments to the p value to correct for the multiple testing problem. I labeled the next two columns to the right with the same label, wt_Bonferroni_p-value.
  2. I typed the equation =<wt_p-value>*6189, Upon completion of this single computation, I used the Step (10) trick to copy the formula throughout the column.=AN2*6189
  3. I replaced any corrected p value that was greater than 1 by the number 1 by typing the following formula into the first cell below the second wt_Bonferroni_p-value header: =IF(wt_Bonferroni_p-value>1,1,wt_Bonferroni_p-value), where "STRAIN_Bonferroni_p-value" refered to the cell in which the first Bonferroni p value computation was made. I used the Step (10) trick to copy the formula throughout the column.=IF(AO2>1,1,AO2)

Calculating the Benjamini & Hochberg p value Correction

  1. I inserted a new worksheet named "wt_ANOVA_B-H".
  2. I copied and pasted the "MasterIndex", "ID", and "Standard Name" columns from my previous worksheet into the first two columns of the new worksheet.
  3. For the following, I Pasted special > Paste values. I copied my unadjusted p values from my ANOVA worksheet and pasted it into Column D.
  4. I selected all of columns A, B, C, and D. I sorted by ascending values on Column D. I clicked the sort button from A to Z on the toolbar, and in the window that appeared, I sorted by column D, smallest to largest.
  5. I typed the header "Rank" in cell E1. I created a series of numbers in ascending order from 1 to 6189 in this column. This was the p value rank, smallest to largest. I typed "1" into cell E2 and "2" into cell E3. I selected both cells E2 and E3. I then double-clicked on the plus sign on the lower right-hand corner of my selection to fill the column with a series of numbers from 1 to 6189.
  6. I then calculated the Benjamini and Hochberg p value correction. I typed wt_B-H_p-value in cell F1. I then typed the following formula in cell F2: =(D2*6189)/E2 and pressed enter. I copied that equation to the entire column.
  7. I typed "wt_B-H_p-value" into cell G1.
  8. I typed the following formula into cell G2: =IF(F2>1,1,F2) and pressed enter. I copied that equation to the entire column.
  9. I selected columns A through G. I then sorted them by my MasterIndex in Column A in ascending order.
  10. I copied column G and used Paste special > Paste values to paste it into the next column on the right of my ANOVA sheet.

File

Media:Wt_Microarraydata_ET.zip

Sanity Check: Number of genes significantly changed

Before moving on to further analysis of the data, I wanted to perform a more extensive sanity check to make sure that I performed my data analysis correctly. I found out the number of genes that were significantly changed at various p value cut-offs.

  • I went to wt_ANOVA worksheet.
  • I selected row 1 (the row with my column headers) and selected the menu item Data > Filter > Autofilter (The funnel icon on the Data tab). Little drop-down arrows appeared at the top of each column. This enabled me to filter the data according to criteria I set.
  • I clicked on the drop-down arrow for the unadjusted p value. I set a criterion that filtered my data so that the p value had to be less than 0.05.
    • How many genes have p < 0.05? and what is the percentage (out of 6189)? 2528; 40.85%
    • How many genes have p < 0.01? and what is the percentage (out of 6189)? 1652; 26.70%
    • How many genes have p < 0.001? and what is the percentage (out of 6189)? 919; 14.85%
    • How many genes have p < 0.0001? and what is the percentage (out of 6189)? 496; 8.0142%
  • When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero by chance less than 5% of the time.
  • I just performed 6189 hypothesis tests. Another way to state what we are seeing with p < 0.05 is that we would expect to see this a gene expression change for at least one of the timepoints by chance in about 5% of our tests, or 309 times. Since we have more than 309 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
    • How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)? 248; 4.01%
    • How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 6189)? 1822; 29.44%
  • In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.
  • We will compare the numbers we get between the wild type strain and the other strains studied, organized as a table. Use this sample PowerPoint slide to see how your table should be formatted. Upload your slide to the wiki.
    • Note that since the wild type data is being analyzed by one of the groups in the class, it will be sufficient for this week to supply just the data for your strain. We will do the comparison with wild type at a later date.
  • Comparing results with known data: the expression of the gene NSR1 (ID: YGR159C)is known to be induced by cold shock. Find NSR1 in your dataset. Unadjusted p-value: 2.86901E-10, Bonferroni-corrected p-value: 1.77563E-06, and B-H-corrected p-value: 7.52066E-10. Average Log fold change at t15=3.279, t30=3.621, t60=3.527, t90=-2.050, t120=-0.6062 Note that the average Log fold change is what we called "STRAIN)_AvgLogFC_(TIME)" in step 3 of the ANOVA analysis. Does NSR1 change expression due to cold shock in this experiment? Since the Log Fold Change is a log2 value, a Log Fold Change greater than or equal to 1, or less than or equal to -1 has at least a 2 fold difference in expression (meaning the gene is changing expression). Because the log fold change meets these parameters at t15 (3.279), t30 (3.621), t60 (3.527), and t90 (-2.050) the gene definitely changes expression. Because NSR1 has a Bonferroni-corrected p value < 0.05, it is among the genes that are the most significantly changed in the dataset because this parameter is the most selective (therefore, we have a lot of confidence that expression in this gene is changing).
  • For fun, find "your favorite gene-ADA2" (from your web page) in the dataset. Unadjusted p-value: 0.008083323, Bonferroni-corrected p-value: 1, and B-H-corrected p-value: 1. Average Log fold change at t15=-1.063078929, t30=-0.44188769, t60=-0.999519175, t90=-1.265472205, t120=-0.479548402. My favorite gene changes expression due to cold shock because t15=-1.063078929, and t90=-1.265472205 (both less than -1). Unlike NSR1 which has a Bonferroni-corrected p value < 0.05, ADA2's unadjusted p-value is ~0.008 which becomes greater than 0.05 in both the Bonferroni and B-H corrections (less confidence that expression in this gene is changing). So comparing NSR1 to ADA2, NSR1 has more significant changes in expression compared to ADA2.


Powerpoint Slide

Media:WT_p-value_slide_ET.pptx

Summary Paragraph

In this week's assignment, microarray data for wild type Saccharomyces (with ratios that had previously been normalized using a statistics package) were analyzed. The data included the log fold change of the red/green ratio at different time points: t15, t30, t60 (cold shock at 13°C) and t90 and t120 (cold shock at 13°C followed by 30 or 60 minutes of recovery at 30°C). Multiple measurements were taken at each time point and statistical analysis was performed on the data to determine the effects of cold shock on the alteration of gene expression. Data analysis included calculations of the unadjusted, Bonferroni-corrected, and B-H-corrected p-values. Changes in gene expression in response to cold shock were recorded from 40.9%, 26.7%, 14.9% and 8.0% of genes with limitations of p < 0.05, p < 0.01, p < 0.001, and p < 0.0001, respectively. Based on this data, less than half of the genes examined are changing expression (specifically, only 40.9% of the genes--and this is when we are designating changes in expression as p < 0.05). However, even less are said to be changing expression when p < 0.05, and only 8.0% of genes are said to be changing expression if we set p < 0.0001. We can have a lot of confidence that Bonferroni-corrected (most stringent), and B-H-corrected p-values (less stringent) were calculated to determine how significantly some genes were affected over others (4.0071% and 32.4123%, respectively displaying a change in gene expression). These p-values help measure confidence level of changes in gene expression (smaller p cutoff corresponds to higher confidence while larger p cutoff corresponds to lower confidence). When comparing NSR1 data to ADA2 (my favorite gene) I concluded that ADA2 was not affected much by cold shock due to the lack of log fold change between time points (remained negative throughout).

Acknowledgements

  1. I worked with my homework partner Eddie Azinge outside of class. We met face-to-face to compare our data.
  2. Dr. Dahlquist for teaching and assisting us with the data analysis
  3. Microsoft Excel to allow for statistical analysis
  4. I copied and modified the instructions from the Week 8 assignment page.
  5. I analyzed Dr. Dahlquist's microarray data. The excel file was downloaded from the Week 8 assignment page.
  6. While I worked with the people noted above, this individual journal entry was completed by me and not copied from another source.

Emmatyrnauer (talk) 15:52, 23 October 2017 (PDT)

References

LMU BioDB 2017. (2017). Week 8. Retrieved October 23, 2017, from https://xmlpipedb.cs.lmu.edu/biodb/fall2017/index.php/Week_8

Links

  1. My User Page
  2. List of Assignments
  3. List of Journal Entries
  4. List of Shared Journal Entries