Malverso Week 8
From LMU BioDB 2015
Contents
- 1 Electronic Lab Notebook
- 2 Part One
- 2.1 Patient A
- 2.2 Patient B
- 2.3 Patient C
- 2.4 Normalizing the Log Ratios
- 2.5 Performing Statistical Analysis on the Ratios
- 2.6 Calculating the Bonferroni p value Correction
- 2.7 Calculating the Benjamini & Hochberg p value Correction
- 2.8 Preparing the file for GenMAPP
- 2.9 Sanity Check: Number of genes significantly changed
- 2.10 Sanity Check: Compare Individual Genes with Known Data
- 3 Part Two
- 4 Files:
- 5 Team Page
- 6 Assignments
Electronic Lab Notebook
Part One
- The data from the Merrell et al. (2002) paper was accessed from the Stanford Microarray Database.
- The Log2 of R/G Normalized Ratio (Median) has been copied from the raw data files downloaded from the Stanford Microarray Database.
Patient A
- Sample 1: 24047.xls (A1)
- Sample 2: 24048.xls (A2)
- Sample 3: 24213.xls (A3)
- Sample 4: 24202.xls (A4)
Patient B
- Sample 5: 24049.xls (B1)
- Sample 6: 24050.xls (B2)
- Sample 7: 24203.xls (B3)
- Sample 8: 24204.xls (B4)
Patient C
- Sample 9: 24053.xls (C1)
- Sample 10: 24054.xls (C2)
- Sample 11: 24205.xls (C3)
- Sample 12: 24206.xls (C4)
- I downloaded the Merrell_Compiled_Raw_Data_Vibrio.xls file to my Desktop and saved it with my initials and the date.
Normalizing the Log Ratios
- To scale and center the data I:
- Inserted a new Worksheet into my Excel file, and named it "scaled_centered".
- Going back to the "compiled_raw_data" worksheet, I clicked to select all and copy. I then went to the "scaled_centered" worksheet, click on the upper, left-hand cell (cell A1) and pasted the values.
- I inserted two rows in between the top row of headers and the first data row.
- In cell A2, I typed "Average" and in cell A3, "StdDev".
- I then computed the Average log ratio for each chip (each column of data).
- In cell B2, I typed =AVERAGE(B4:B5224)and then pressed Enter.
- I then computed the Standard Deviation of the log ratios on each chip (each column of data).
- In cell B3 I typed = STDEV(B4:B5224)and then pressed enter.
- I then clicked on B2 and dragged it across all of the columns to copy the equation across all the data. I repeated this with B3 as well. Excel automatically changed the equation to match the cell designations for those columns.
- I copied the column headings for all of my data columns and then pasted them to the right of the last data column so that there was a second set of headers above blank columns of cells. I Edited the names of the columns so that they read: A1_scaled_centered, A2_scaled_centered, etc.
- In cell N4, I typed =(B4-B$2)/B$3 so that the data in cell B4 has the average subtracted from it (cell B2) and is divided by the standard deviation (cell B3). I used the dollar sign symbols in front of the "2" and "3" to tell Excel to always reference that row in the equation, even though I will pasted it for the entire column of 5221 genes.
- Why is this important? This is important because if we didn’t use the dollar signs, Excel would assume that each cell should be subtracted by the cell two above it and divided by the cell directly above it instead of always by the average and standard deviation.
- I copy and pasted this equation into the entire column by clicking on the original cell with my equation and position my cursor at the bottom right corner. When the curser changed to a thin black plus sign (not a chubby white one) I double clicked, and the formula magically copied to the entire column of genes.
- I then copied and pasted the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header, making sure to adjust the equations to pertain to their respective columns.
Performing Statistical Analysis on the Ratios
- This is performed on the scaled and centered data produced in the previous step.
- I inserted a new worksheet and named it "statistics".
- Going back to the "scaling_centering" worksheet, I copied the first column ("ID") and pasted the data into the first column of the new "statistics" worksheet.
- I also went back to the "scaling_centering" worksheet and copied the columns designated "_scaled_centered" and copied them by clicking on the B1 cell and selecting "Paste Special" > “Values” from the Edit menu. This pasted the numerical result into my new worksheet instead of the equation.
- Next I deleted Rows 2 and 3 where it says "Average" and "StDev" so that my data rows with gene IDs are immediately below the header row 1.
- I went to a new column on the right of my worksheet and typed the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns.
- I computed the average log fold change for the replicates for each patient by typing =AVERAGE(B2:E2) into cell N2. I copied this equation and pasted it into the rest of the column.
- Next I created the equation for patients B and C and pasted it into their respective columns.
- I computed the average of the averages. I typed the header "Avg_LogFC_all" into the first cell in the next empty column and created the equation that will compute the average of the three previous averages and pasted it into this entire column.
- I inserted a new column next to the "Avg_LogFC_all" column and labeled it "Tstat". This will compute a T statistic that tells us whether the scaled and centered average log ratio is significantly different than 0 (no change).
- I entered the equation: ¬= AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(3))and copied it and then pasted it into all rows in that column.
- Next I labeled the top cell in the next column "Pvalue". In the cell below the label, I entered the equation: = TDIST(ABS(R2),degrees of freedom,2)
- The number of degrees of freedom is the number of replicates minus one, so in this case there are 2 degrees of freedom. I copied the equation and pasted it into all rows in that column.
Calculating the Bonferroni p value Correction
- I performed adjustments to the p value to correct for the multiple testing problem. I labeled the next two columns to the right with the same label, Bonferroni_Pvalue.
- I typed the equation =S2*5221, and used the trick to copy the formula throughout the column.
- I replaced any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second Bonferroni_Pvalue header: =IF(T2>1,1,T2). Using the trick, I copied the formula throughout the column.
Calculating the Benjamini & Hochberg p value Correction
Insert a new worksheet named "B-H_Pvalue".
- I copied and pasted the "ID" column from the previous worksheet into the first column of the new worksheet.
- I then inserted a new column on the very left and named it "MasterIndex". This creates a numerical index of genes so that I could sort them back into the same order later.
- I typed a "1" in cell A2 and a "2" in cell A3.
- Selecting both cells, I hovered my mouse over the bottom-right corner of the selection until it made a thin black + sign. I double-clicked on the + sign to fill the entire column with a series of numbers from 1 to 5221 (the number of genes on the microarray).
- For the following, I used Paste special > Paste values. I copied the unadjusted p values from the previous worksheet and pasted it into Column C.
- I selected all of columns A, B, and C and sorted by ascending values on Column C by clicking the sort button from A to Z on the toolbar, and in the window that appears, sorting by column C, smallest to largest.
- I typed the header "Rank" in cell D1. I typed "1" into cell D2 and "2" into cell D3. I selected both cells D2 and D3 and Double-click on the plus sign on the lower right-hand corner of my selection to fill the column with a series of numbers from 1 to 5221.
- I now calculated the Benjamini and Hochberg p value correction.
- I typed B-H_Pvalue in cell E1 and typed the following formula in cell E2: =(C2*5221)/D2 and pressed enter and then copied that equation to the entire column.
- I typed "B-H_Pvalue" into cell F1.
- Then I typed the following formula into cell F2: =IF(E2>1,1,E2) and pressed enter, and copied that equation to the entire column.
- I selected columns A through F and sorted them by MasterIndex in Column A in ascending order.
- I copied column F and used Paste special > Paste values to paste it into the next column on the right of my "statistics" sheet.
Preparing the file for GenMAPP
- I inserted a new worksheet and named it "forGenMAPP" and copied everything from the “statistics” worksheet and pasted it into cell A1, making sure that the values were pasted.
- I selected Columns B through Q (all the fold changes). I then selected the menu item Format > Cells. Under the number tab, I selected 2 decimal places, and clicked OK.
- I next selected all the columns containing p values. I selected the menu item Format > Cells. Under the number tab, I selected 4 decimal places and clicked OK.
- I deleted the left-most Bonferroni p value column, preserving the one that shows the result of the "if" statement.
- I inserted a column to the right of the "ID" column and typed the header "SystemCode" into the top cell. I filled the entire column (each cell) with the letter "N".
- I selected the menu item File > Save As, and choose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu. I clicked through the various warnings.
- I then uploaded both the .xls and .txt files to this journal page.
Sanity Check: Number of genes significantly changed
- Before I moved on to the GenMAPP/MAPPFinder analysis, I wanted to perform a sanity check to make sure that I performed my data analysis correctly by comparing my results to the published results of Merrell et al. (2002).
- I opened my spreadsheet and went to the "forGenMAPP" tab.
- I clicked on cell A1 and selected the menu item Data > Filter > Autofilter. Little drop-down arrows appeared at the top of each column. This let me filter the data according to criteria.
- I clicked on the drop-down arrow on the "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
- How many genes have p value < 0.05? and what is the percentage (out of 5221)? 948 genes. This is 18.16% of the genes.
- What about p < 0.01? and what is the percentage (out of 5221)? 235 genes. This is 4.50% of the genes.
- What about p < 0.001? and what is the percentage (out of 5221)? 24 genes. This is 0.46% of the genes.
- What about p < 0.0001? and what is the percentage (out of 5221)? 2 genes. This is 0.04% of the genes.
- When I used a p value cut-off of p < 0.05, what I am is that I would have seen a gene expression change that deviates this far from zero less than 5% of the time.
- I have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that I would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. Since I have more than 261 genes that pass this cut off, I know that some genes are significantly changed. However, I don't know which ones. To apply a more stringent criterion to my p values, I performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent.
- How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 5221)? 0 genes. This is 0% of the genes.
- How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 5221)? 0 genes. This is 0% of the genes.
- In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If I want to be very confident of my data, I would use a small p value cut-off. If I am OK with being less confident about a gene expression change and want to include more genes in the analysis, I can use a larger p value cut-off.
- The "Avg_LogFC_all" tells the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
- I kept the (unadjusted) "Pvalue" filter at p < 0.05, and filtered the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero.
- How many are there? (and %) 352 genes. This is 6.7% of the genes.
- I kept the (unadjusted) "Pvalue" filter at p < 0.05 and filtered the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero.
- How many are there? (and %) 596 genes. This is 11.34% of the genes
- What about an average log fold change of > 0.25 and p < 0.05? (and %) 339 genes. This is 17.83% of the genes.
- Or an average log fold change of < -0.25 and p < 0.05? (and %) (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.) 579 genes. This is 11.09% of the genes.
- For the GenMAPP analysis below, I used the fold change cut-off of greater than 0.25 or less than -0.25 and the unadjusted p value cut off of p < 0.05 for my analysis because I want to include several hundred genes in my analysis.
- What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method? Merrell et al. (2002) used a “two-class SAM analysis” with the determination of significance based on the level of expression which must have been at least a twofold change. Merrell et al. (2002) looked at the actual change in expression to figure out what was significant while I used the p value, which is the probability that changes in expression were due to chance. Merrel et al. (2002) used a more stringent method because they found 237 genes that were significantly changed while we found 918 using the criteria: -0.25 < pvalue >0 .25.
Sanity Check: Compare Individual Genes with Known Data
- Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. I looked these genes up in my spreadsheet.
- What are their fold changes and p values? Are they significantly changed in my analysis?
- VC0028 (two entries)
- fold change: 1.65 pvalue:.0474 This is significantly changed because the pvalue is less than .05 and the fold change is greater than .25.
- fold change: 1.27 pvalue:.0692 This is not significantly changed because the pvalue is greater than .05.
- VC0941 (two entries)
- fold change:-.28 pvalue:.1636 This is not significantly changed because the pvalue is greater than .05.
- fold change:.88 pvalue:.0001 This is significantly changed according to my analysis.
- VC0869 (five entries)
- fold change:2.12 pvalue:.02 This is significantly changed.
- fold change:1.50 pvalue:.0174 This is significantly changed.
- fold change:1.59 pvalue:.0463 This is significantly changed.
- fold change:1.95 pvalue:.0227 This is significantly changed.
- fold change:2.20 pvalue:.002 This is significantly changed.
- VC0051 (two entries)
- fold change: 1.89 pvalue:.016 This is significantly changed.
- fold change: 1.92 pvalue:.0139 This is significantly changed.
- VC0647 (three entries)
- fold change: -1.11 pvalue:.0003 This is significantly changed.
- fold change:-0.94 pvalue:.0125 This is significantly changed.
- fold change:-1.05 pvalue:.0051 This is not significantly changed because the pvalue is greater than .05.
- VC0468
- fold change:.17 pvalue:.3350 This is not significantly changed because the pvalue is greater than .05.
- VC2350
- fold change: -2.40 pvalue:.0130 This is significantly changed.
- VCA0583
- fold change:1.06 pvalue:.1011 This is not significantly changed because the pvalue is greater than .05.
- VC0028 (two entries)
Part Two
- I used the 2010 database.
- There were 121 errors.
- Kristin used the 2009 database and 772 errors were detected.
- My database is newer and therefore it makes sense that the number of errors has decreased from the previous year, because it makes sense that the change in the database from 2009 - 2010 was an improvement. The database I used probably has more entries and less bugs.
- We were an increased pair, which I labeled red.
- I labeled decreased with green.
Gene Ontology Results
- branched chain family amino acid metabolic process
- branched chain family amino acid biosynthetic process
- IMP metabolic process
- IMP biosynthetic process
- purine ribonucleoside monophosphate biosynthetic process
- purine ribonucleoside monophosphate metabolic process
- purine nucleoside monophosphate metabolic process
- purine nucleoside monophosphate biosynthetic process
- 'de novo' IMP biosynthetic process
- arginine metabolic process
- Our results were completely different. There must have been some significant findings in the year 2009 that uncovered significant gene changes.
Files:
File:Merrell Compiled Raw Data Vibrio MA 20151015 (2).xls
File:Merrell Compiled Raw Data Vibrio MA 20151015 (2).txt
File:Merrell Compiled Raw Data Vibrio MA 20151015 (2).gex
File:Merrell Compiled Raw Data Vibrio MA 20151015 (2).EX.txt
Team Page
Assignments
- Week 1
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- Week 13
- Week 14
- Week 15
Individual Journal Entries
- Malverso User Page (Week 1)
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- Week 13
- Week 14
- Week 15