Difference between revisions of "Jwoodlee Week 8"
(added sanity check) |
(added answers) |
||
Line 91: | Line 91: | ||
#How many genes have p value < 0.05? and what is the percentage (out of 5221)? | #How many genes have p value < 0.05? and what is the percentage (out of 5221)? | ||
− | #* | + | #*I found that 948 genes had a pvalue of less than 0.05 which is about 18%. |
#What about p < 0.01? and what is the percentage (out of 5221)? | #What about p < 0.01? and what is the percentage (out of 5221)? | ||
− | #* | + | #*235, 4.5% |
#What about p < 0.001? and what is the percentage (out of 5221)? | #What about p < 0.001? and what is the percentage (out of 5221)? | ||
− | #* | + | #*24, 0.46% |
#What about p < 0.0001? and what is the percentage (out of 5221)? | #What about p < 0.0001? and what is the percentage (out of 5221)? | ||
− | #* | + | #*2, 0.04% |
When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time. | When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time. |
Revision as of 23:19, 25 October 2015
Electronic Lab Notebook
I downloaded the raw data from the Merrell et al. experiment, found here.
I then followed the procedure on that page:
Part 1 Procedure
Procedure taken from [1]. Reworded to reflect what I did.
I scaled and centered the data (between chip normalization) by performing the following operations: Inserted a new Worksheet into the excel file, and named it "scaled_centered". Went back to the "compiled_raw_data" worksheet, selected all(command A) and copied(command C). Went to the "scaled_centered" worksheet, clicked on the upper, left-hand cell and pasted.
Inserted two rows in between the top row of headers and the first data row.
In cell A2, typed "Average" and in cell A3, typed "StdDev".
I then computed the average for the first column. In cell B2, I typed the following equation:
=AVERAGE(B4:B5224)
and pressed "Enter".
I then computed the standard deviation for the first column. In cell B3, I typed the following equation:
=STDEV(B4:B5224)
and pressed "Enter".
I copied these two equations (cells B2 and B3) and pasted them into the empty cells in the rest of the columns to calculate for the rest of the columns.
I copied the column headings for all of my data columns and then pasted them to the right of the last data column so that I had a second set of headers above blank columns of cells. I then edited the names of the columns so that they now read: A1_scaled_centered, A2_scaled_centered, etc. In cell N4, I typed the following equation: =(B4-B$2)/B$3 We wanted the data in cell B4 to have the average subtracted from it (cell B2) and be divided by the standard deviation (cell B3). The dollar signs make sure that the proper cells are always used. I then copied and pasted this equation into the entire column across all rows.
In order to perform statistical analysis on the data, I inserted a new worksheet called "statistics". Went back to the "scaled_centered" worksheet and copied the ID column and pasted the data from that column into the first column of the statistics worksheet.
I went back to the "scaling_centering" worksheet and copied the columns that are designated "scaled_centered".
I went back to my new worksheet and clicked on the B1 cell. I then selected "Paste Special" from the Edit menu so I could get the values from the previous worksheet without the equations. I then deleted rows 2 and 3 that were there previously for the average and standard deviation. I then added 3 new headers to blank columns to the right: "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C".
I then computed the average log fold change for the replicates for each patient by typing the equation:
=AVERAGE(B2:E2)
into cell N2. I then pasted this equation into the rest of the column.
I then created the equations for patients B and C and pasted them into their respective columns. The equations are: =AVERAGE(J2:M2)
and =AVERAGE(F2:I2)
I then got the average of the averages and put them into the next column on the right with the header: "Avg_LogFC_all". The equation is: =AVERAGE(N2:P2)
I inserted a new column next to the "Avg_LogFC_all" column and labeled the column "Tstat". I then entered the equation:
=AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates))
(In this case the number of replicates was 3). I spread this equation out to the whole column.
I labeled the top cell in the next column "Pvalue". In the cell below the label, I entered the equation: =TDIST(ABS(R2),degrees of freedom,2)
The number of degrees of freedom is the number of replicates minus one, so in this case there were 2 degrees of freedom. I then pasted the equation throughout the whole column.
Now I had to adjust the pvalues to account for the multiple testing problem. I labeled the next two columns to the right with the same label, Bonferroni_Pvalue.
I typed the equation =S2*5221
, and copied it to the entire first Bonferroni_Pvalue column.
In the second Bonferroni_Pvalue column I replaced any corrected p value that is greater than 1 by the number 1 by typing =IF(T2>1,1,T2)
into the first cell below the second Bonferroni_Pvalue header. I then copied that equation into the entire column.
I inserted a new worksheet named "B-H_Pvalue" where the Benjamini and Hochberg Pvalue correction would be calculated.
I copy and pasted the ID column from the previous worksheet into this new one.
I inserted a new column on the very left and name it "MasterIndex".
I typed a "1" in cell A2 and a "2" in cell A3 and used this pattern to fill out the rest of the column with numbers 1-5221.
I pasted the only values(not the equations) of my unadjusted pvalues into my new worksheet at column C.
I then selected all of columns A, B, and C, and sorted by ascending values on Column C.
I added the header "Rank" in cell D1, where a series of numbers in ascending order from 1-5221 will reside.
This is the p value rank, smallest to largest.
I used the same trick as before to populate this column with ascending numbers.
I typed B-H_Pvalue in cell E1, and typed the following formula in cell E2: =(C2*5221)/D2
and pressed enter and applied that to the whole column.
I then typed B_HPvalue into cell F1.
Then I added =IF(E2>1,1,E2)
into cell F2 and pressed enter and applied it to the whole column.
I then sorted columns A through F in order by the MasterIndex.
I copied column F and used paste special to paste it into the next column on the right of the "statistics" sheet.
Now it was time to prepare the file for GenMapp. I inserted a new worksheet and named it "forGenMAPP". I then pasted the statistic worksheet into the forGenMapp worksheet by values. To prepare the data for GenMapp, I selected Columns B through Q (all the fold changes), selected the menu item Format > Cells, and under the number tab, selected 2 decimal places. I selected all the columns containing p values. Selected the menu item Format > Cells, and under the number tab, selected 4 decimal places. I deleted the left-most Bonferroni p value column, and preserved the one that shows the result of the "if" statement. I inserted a column to the right of the "ID" column, typed the header "SystemCode" into the top cell of this column, and filled the entire column (each cell) with the letter "N". I selected the menu item File > Save As, and choose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu. This text file would be what I use later on when I am working with GenMapp
Sanity Checks:
Before I moved on to the GenMAPP/MAPPFinder analysis, I needed to verify that I did the analysis correctly.
I found out the number of genes that were significantly changed at various p value cut-offs and also compared the data analysis with the published results of Merrell et al. (2002).
I opened my spreadsheet and went to the "forGenMAPP" tab.
I clicked on cell A1 and selected the menu item Data > Filter > Autofilter. I clicked on the drop-down arrow on the "Pvalue" column, selected "Custom", and in the window that appeared, set a criterion that will filter the data so that the Pvalue has to be less than 0.05.
- How many genes have p value < 0.05? and what is the percentage (out of 5221)?
- I found that 948 genes had a pvalue of less than 0.05 which is about 18%.
- What about p < 0.01? and what is the percentage (out of 5221)?
- 235, 4.5%
- What about p < 0.001? and what is the percentage (out of 5221)?
- 24, 0.46%
- What about p < 0.0001? and what is the percentage (out of 5221)?
- 2, 0.04%
When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time. We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: http://xkcd.com/882/.) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following: How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 5221)? How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 5221)? In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off. The "Avg_LogFC_all" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control. Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there? (and %) Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero. How many are there? (and %) What about an average log fold change of > 0.25 and p < 0.05? (and %) Or an average log fold change of < -0.25 and p < 0.05? (and %) (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.) In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off. For the GenMAPP analysis below, we will use the fold change cut-off of greater than 0.25 or less than -0.25 and the unadjusted p value cut off of p < 0.05 for our analysis because we want to include several hundred genes in our analysis. What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method? Sanity Check: Compare individual genes with known data
Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. Look these genes up in your spreadsheet. What are their fold changes and p values? Are they significantly changed in our analysis?
File:ModifiedDataMerrellJwoodlee.gmf
File:ModifiedDataMerrellJwoodlee-Criterion1-GO filters.xls
File:ModifiedDataMerrellJwoodlee-Criterion1-GO.txt
File:ModifiedDataMerrellJwoodlee.gex
File:ModifiedDataMerrellJwoodlee.EX.txt
File:ModifiedDataMerrellJwoodleeText.txt
File:ModifiedDataMerrellJwoodlee.xls
File:Metal ion binding Jwoodlee.mapp
BIOL 367, Fall 2015, User Page, Team Page
Weekly Assignments | Individual Journal Pages | Shared Journal Pages |
---|---|---|
|
|
|