Emilysimso Week 8
Part One
Directions
- Taken from Sample Minds Part 1
To scale and center the data (between chip normalization) I performed the following operations:
- Inserted a new Worksheet into my Excel file, and name it "scaled_centered".
- Went back to the "compiled_raw_data" worksheet, Select All and Copy. Went to my new "scaled_centered" worksheet, clicked on the upper, left-hand cell (cell A1) and Paste.
- Inserted two rows in between the top row of headers and the first data row.
- In cell A2, typed "Average" and in cell A3, typed "StdDev".
- Computed the Average log ratio for each chip (each column of data). In cell B2, typed the following equation:
=AVERAGE(B4:B5224)
- and pressed "Enter".
- Computed the Standard Deviation of the log ratios on each chip (each column of data). In cell B3, typed the following equation:
=STDEV(B4:B5224)
- and pressed "Enter".
Copied these two equations (cells B2 and B3) and pasted them into the empty cells in the rest of the columns.
Copied the column headings for all of my data columns and then pasted them to the right of the last data column so that I had a second set of headers above blank columns of cells. Edited the names of the columns so that they read: A1_scaled_centered, A2_scaled_centered, etc.
- In cell N4, typed the following equation:
=(B4-B$2)/B$3
Copied and pasted this equation into the entire column.
Copied and pasted the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header.
I then performed statistical analysis on the ratio
- Inserted a new worksheet and named it "statistics".
- Went back to the "scaling_centering" worksheet and copied the first column ("ID").
- Pasted the data into the first column of my new "statistics" worksheet.
- Went back to the "scaling_centering" worksheet and copied the columns that were designated "_scaled_centered".
- Went back to my new worksheet and clicked on the B1 cell. Selected "Paste Special" from the Edit menu. A window opened: clicked on the radio button for "Values" and clicked OK.
- Deleted Rows 2 and 3 where it said "Average" and "StDev" so that my data rows with gene IDs were immediately below the header row 1.
- Created a new column on the right of your worksheet. Typed the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns.
- Computed the average log fold change for the replicates for each patient by typing the equation:
=AVERAGE(B2:E2)
- into cell N2. Copied this equation and pasted it into the rest of the column.
Created the equation for patients B and C and pasted it into their respective columns.
Typed the header "Avg_LogFC_all" into the first cell in the next empty column. Created the equation that computed the average of the three previous averages I calculated and pasted it into this entire column.
Inserted a new column next to the "Avg_LogFC_all" column that I computed in the previous step. Labeled the column "Tstat". This computed a T statistic that told me whether the scaled and centered average log ratio were significantly different than 0 (no change). Entered the equation:
=AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates))
(NOTE: in this case the number of replicates is 3) Copied the equation and pasted it into all rows in that column.
Labeled the top cell in the next column "Pvalue". In the cell below the label, entered the equation:
=TDIST(ABS(R2),degrees of freedom,2)
The number of degrees of freedom is the number of replicates minus one, so in my case there were 2 degrees of freedom. Copied the equation and pasted it into all rows in that column.
Calculated the Bonferroni p value Correction
Labeled the next two columns to the right with the same label, Bonferroni_Pvalue. Typed the equation
=S2*5221
Upon completion of this single computation, used the trick to copy the formula throughout the column.
Replaced any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second Bonferroni_Pvalue header:
=IF(T2>1,1,T2)
Used the trick to copy the formula throughout the column.
Calculated the Benjamini & Hochberg p value Correction
- Inserted a new worksheet named "B-H_Pvalue".
- Copied and pasted the "ID" column from my previous worksheet into the first column of the new worksheet.
- Inserted a new column on the very left and named it "MasterIndex".
- Typed a "1" in cell A2 and a "2" in cell A3.
- Selected both cells. Hovered my mouse over the bottom-right corner of the selection until it made a thin black + sign. Double-clicked on the + sign to fill the entire column with a series of numbers from 1 to 5221 (the number of genes on the microarray).
- For the following, used Paste special > Paste values. Copied my unadjusted p values from my previous worksheet and pasted it into Column C.
- Selected all of columns A, B, and C. Sorted by ascending values on Column C. Clicked the sort button from A to Z on the toolbar, in the window that appears, sorted by column C, smallest to largest.
- Typed the header "Rank" in cell D1. Typed "1" into cell D2 and "2" into cell D3. Selected both cells D2 and D3. Double-clicked on the plus sign on the lower right-hand corner of my selection to fill the column with a series of numbers from 1 to 5221.
- Typed B-H_Pvalue in cell E1. Typed the following formula in cell E2:
=(C2*5221)/D2
- and pressed enter. Copied that equation to the entire column.
- Typed "B-H_Pvalue" into cell F1.
- Typed the following formula into cell F2:
=IF(E2>1,1,E2)
- and pressed enter. Copied that equation to the entire column.
- Selected columns A through F. Sorted them by your MasterIndex in Column A in ascending order.
- Copied column F and used Paste special > Paste values to paste it into the next column on the right of my "statistics" sheet.
Prepared file for GenMAPP
- Inserted a new worksheet and named it "forGenMAPP".
- Went to the "statistics" worksheet and hit Select All and Copy.
- Went to my new sheet and clicked on cell A1 and selected Paste Special, clicked on the Values radio button, and clicked OK.
- Selected Columns B through Q (all the fold changes). Selected the menu item Format > Cells. Under the number tab, selected 2 decimal places. Clicked OK.
- Selected all the columns containing p values. Selected the menu item Format > Cells. Under the number tab, selected 4 decimal places. Clicked OK.
- Deleted the left-most Bonferroni p value column, preserving the one that showed the result of my "if" statement.
- Inserted a column to the right of the "ID" column. Typed the header "SystemCode" into the top cell of this column. Filled the entire column (each cell) with the letter "N".
- Selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu.
- Uploaded both the .xls and .txt files that I created to my journal page in the class wiki.
Sanity Check: Number of genes significantly changed
- Opened my spreadsheet and went to the "forGenMAPP" tab.
- Clicked on cell A1 and selected the menu item Data > Filter > Autofilter.
- Clickd on the drop-down arrow on your "Pvalue" column. Selected "Custom". In the window that appeared, I set a criterion that filtered my data so that the Pvalue has to be less than 0.05.
- How many genes have p value < 0.05? and what is the percentage (out of 5221)?
- What about p < 0.01? and what is the percentage (out of 5221)?
- What about p < 0.001? and what is the percentage (out of 5221)?**What about p < 0.0001? and what is the percentage (out of 5221)?
- I performed 5221 T tests for significance. To apply a more stringent criterion to my p values, I performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, I filtered my data to determine the following:
- How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 5221)?
- How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 5221)?
- Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there? (and %)
- Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero. How many are there? (and %)
- What about an average log fold change of > 0.25 and p < 0.05? (and %)
- Or an average log fold change of < -0.25 and p < 0.05? (and %) (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
- What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method?
Sanity Check: Compare individual genes with known data
- Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. Look these genes up in your spreadsheet. What are their fold changes and p values? Are they significantly changed in our analysis?
Notes
Opened up the spreadsheet for the first time
- Taken from Sample Minds Part 1
Scaled and centered the data
- It was important to always reference back to the same rows using a dollar sign symbol attached to B2 and B3 (becoming B2$ and B2$) because we want to always compare the other call to these numerical values.
Performed statistical analysis on the data (averages, StdDev, Tstat, Pvalue, Bonferroni Pvalue)
File after work day on October 15, 2015 - File:Copy of Merrell Compiled Raw Data Vibrio ES 20151015.xls
Updated Files
- Xcel file after work day on October 20, 2015 - File:Copy of Merrell Compiled Raw Data Vibrio ES 20151020.xls
- .txt. file after work day on October 20, 2015 - File:Copy of Merrell Compiled Raw Data Vibrio ES 20151020.txt
Questions
- 99 genes have a P value < 0.05
- 19 genes have a P value < 0.01
- 1 gene has a P value < 0.001
- No genes have a P value < 0.0001
- These values do not match with known values given in instructions from Part 1
Created a new spreadsheet
- File:Copy of Merrell Compiled Raw Data Vibrio ES 20151021.xlsx
- File:Copy of Merrell Compiled Raw Data Vibrio ES 20151021.txt
Responses to Questions
- 361 genes have a P value < 0.05
- 94 genes have a P value < 0.01
- 12 genes have a P value < 0.001
- 1 gene has a P value < 0.0001
- No genes have a Bonferroni-corrected P value < 0.05 - 0%
- No genes have a B-H corrected P value < 0.05 - 0%
- 149 genes have a Pvalue <0.05 and a Avg_LogFC_all > 0
- 212 genes have a Pvalue <0.05 and a Avg_LogFC_all < 0
- No genes have a Pvalue <0.05 and a Avg_LogFC_all >.25 and <0.05
- 6 genes have a Pvalue <0.05 and a Avg_LogFC_all >.25 and <0.05
Part Two
- Used the 2009 database
- There were 772 errors in this data
- Here is the file EX.txt - File:Copy of Merrell Compiled Raw Data Vibrio ES 20151020.EX.txt