Difference between revisions of "Malverso Week 8"

From LMU BioDB 2015
Jump to: navigation, search
(inputted protocol for normalizing the log ratios)
(added protocol for performing statistical analysis(...))
Line 39: Line 39:
 
*I copy and pasted this equation into the entire column by clicking on the original cell with my equation and position my cursor at the bottom right corner. When the curser changed to a thin black plus sign (not a chubby white one) I double clicked, and the formula magically copied to the entire column of genes.
 
*I copy and pasted this equation into the entire column by clicking on the original cell with my equation and position my cursor at the bottom right corner. When the curser changed to a thin black plus sign (not a chubby white one) I double clicked, and the formula magically copied to the entire column of genes.
 
*I then copied and pasted the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header, making sure to adjust the equations to pertain to their respective columns.   
 
*I then copied and pasted the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header, making sure to adjust the equations to pertain to their respective columns.   
 
+
===Performing Statistical Analysis on the Ratios===
 
+
*This is performed on the scaled and centered data produced in the previous step.
 +
*I inserted a new worksheet and named it "statistics".
 +
*Going back to the "scaling_centering" worksheet, I copied the first column ("ID") and pasted the data into the first column of the new "statistics" worksheet.
 +
* I also went back to the "scaling_centering" worksheet and copied the columns designated "_scaled_centered" and copied them by clicking on the B1 cell and selecting "Paste Special" > “Values” from the Edit menu. This pasted the numerical result into my new worksheet instead of the equation.
 +
*Next I deleted Rows 2 and 3 where it says "Average" and "StDev" so that my data rows with gene IDs are immediately below the header row 1.
 +
*I went to a new column on the right of my worksheet and typed the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns.
 +
*I computed the average log fold change for the replicates for each patient by typing =AVERAGE(B2:E2) into cell N2. I copied this equation and pasted it into the rest of the column.
 +
*Next I created the equation for patients B and C and pasted it into their respective columns.
 +
*I computed the average of the averages. I typed the header "Avg_LogFC_all" into the first cell in the next empty column and created the equation that will compute the average of the three previous averages and pasted it into this entire column.
 +
* I inserted a new column next to the "Avg_LogFC_all" column and labeled it "Tstat". This will compute a T statistic that tells us whether the scaled and centered average log ratio is significantly different than 0 (no change).
 +
*I entered the equation: ¬= AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(3))and copied it and then pasted it into all rows in that column.
 +
*Next I labeled the top cell in the next column "Pvalue". In the cell below the label, I entered the equation: = TDIST(ABS(R2),degrees of freedom,2)
 +
*The number of degrees of freedom is the number of replicates minus one, so in this case there are 2 degrees of freedom. I copied the equation and pasted it into all rows in that column.
  
  

Revision as of 06:08, 26 October 2015

Electronic Lab Notebook

Part One

  • The data from the Merrell et al. (2002) paper was accessed from the Stanford Microarray Database.
  • The Log2 of R/G Normalized Ratio (Median) has been copied from the raw data files downloaded from the Stanford Microarray Database.

Patient A

  • Sample 1: 24047.xls (A1)
  • Sample 2: 24048.xls (A2)
  • Sample 3: 24213.xls (A3)
  • Sample 4: 24202.xls (A4)

Patient B

  • Sample 5: 24049.xls (B1)
  • Sample 6: 24050.xls (B2)
  • Sample 7: 24203.xls (B3)
  • Sample 8: 24204.xls (B4)

Patient C

  • Sample 9: 24053.xls (C1)
  • Sample 10: 24054.xls (C2)
  • Sample 11: 24205.xls (C3)
  • Sample 12: 24206.xls (C4)
  • I downloaded the Merrell_Compiled_Raw_Data_Vibrio.xls file to my Desktop and saved it with my initials and the date.

Normalizing the Log Ratios

  • To scale and center the data I:
    • Inserted a new Worksheet into my Excel file, and named it "scaled_centered".
    • Going back to the "compiled_raw_data" worksheet, I clicked to select all and copy. I then went to the "scaled_centered" worksheet, click on the upper, left-hand cell (cell A1) and pasted the values.
    • I inserted two rows in between the top row of headers and the first data row.
    • In cell A2, I typed "Average" and in cell A3, "StdDev".
  • I then computed the Average log ratio for each chip (each column of data).
    • In cell B2, I typed =AVERAGE(B4:B5224)and then pressed Enter.
  • I then computed the Standard Deviation of the log ratios on each chip (each column of data).
    • In cell B3 I typed = STDEV(B4:B5224)and then pressed enter.
    • I then clicked on B2 and dragged it across all of the columns to copy the equation across all the data. I repeated this with B3 as well. Excel automatically changed the equation to match the cell designations for those columns.
  • I copied the column headings for all of my data columns and then pasted them to the right of the last data column so that there was a second set of headers above blank columns of cells. I Edited the names of the columns so that they read: A1_scaled_centered, A2_scaled_centered, etc.
  • In cell N4, I typed =(B4-B$2)/B$3 so that the data in cell B4 has the average subtracted from it (cell B2) and is divided by the standard deviation (cell B3). I used the dollar sign symbols in front of the "2" and "3" to tell Excel to always reference that row in the equation, even though I will pasted it for the entire column of 5221 genes.
  • Why is this important? This is important because if we didn’t use the dollar signs, Excel would assume that each cell should be subtracted by the cell two above it and divided by the cell directly above it instead of always by the average and standard deviation.
  • I copy and pasted this equation into the entire column by clicking on the original cell with my equation and position my cursor at the bottom right corner. When the curser changed to a thin black plus sign (not a chubby white one) I double clicked, and the formula magically copied to the entire column of genes.
  • I then copied and pasted the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header, making sure to adjust the equations to pertain to their respective columns.

Performing Statistical Analysis on the Ratios

  • This is performed on the scaled and centered data produced in the previous step.
  • I inserted a new worksheet and named it "statistics".
  • Going back to the "scaling_centering" worksheet, I copied the first column ("ID") and pasted the data into the first column of the new "statistics" worksheet.
  • I also went back to the "scaling_centering" worksheet and copied the columns designated "_scaled_centered" and copied them by clicking on the B1 cell and selecting "Paste Special" > “Values” from the Edit menu. This pasted the numerical result into my new worksheet instead of the equation.
  • Next I deleted Rows 2 and 3 where it says "Average" and "StDev" so that my data rows with gene IDs are immediately below the header row 1.
  • I went to a new column on the right of my worksheet and typed the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns.
  • I computed the average log fold change for the replicates for each patient by typing =AVERAGE(B2:E2) into cell N2. I copied this equation and pasted it into the rest of the column.
  • Next I created the equation for patients B and C and pasted it into their respective columns.
  • I computed the average of the averages. I typed the header "Avg_LogFC_all" into the first cell in the next empty column and created the equation that will compute the average of the three previous averages and pasted it into this entire column.
  • I inserted a new column next to the "Avg_LogFC_all" column and labeled it "Tstat". This will compute a T statistic that tells us whether the scaled and centered average log ratio is significantly different than 0 (no change).
  • I entered the equation: ¬= AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(3))and copied it and then pasted it into all rows in that column.
  • Next I labeled the top cell in the next column "Pvalue". In the cell below the label, I entered the equation: = TDIST(ABS(R2),degrees of freedom,2)
  • The number of degrees of freedom is the number of replicates minus one, so in this case there are 2 degrees of freedom. I copied the equation and pasted it into all rows in that column.


Files:

File:Merrell Compiled Raw Data Vibrio MA 20151015 (2).xls

File:Merrell Compiled Raw Data Vibrio MA 20151015 (2).txt

File:Merrell Compiled Raw Data Vibrio MA 20151015 (2).gex

File:Merrell Compiled Raw Data Vibrio MA 20151015 (2).EX.txt

Sanity Check: Number of Genes Significantly Changed

  • Number of genes with a p value less than .05 = 948. This is 18.16% of the genes.
  • Number of genes with a p value less than .01 = 235. This is 4.50% of the genes.
  • Number of genes with a p value less than .001 = 24. This is 0.46% of the genes.
  • Number of genes with a p value less than .0001 = 2. This is 0.04% of the genes.
  • Number of genes with a Bonferroni p value less than .05 = 0. This is 0% of the genes.
  • Number of genes with a B-H p value less than .05 = 0. This is 0% of the genes.
  • Avg_LogFC_all > 0 = 352. 6.7%
  • < 0 = 596. 11.34%
  • > .25 = 339. 17.83%
  • <-.25 = 579. 11.09%
  • "two-class SAM analysis was conducted" "with statistically significant changes in the level of expression - at least a twofold change" They used the actual level of expression change to figure out what was significant while we used p value (which is the probability that changes in expression are due to chance). Merrell et al. (2002) used a more stringent method because they found 237 genes that were significantly changed while we found 918. (-.25 < pvalue > .25)

Sanity Check: Compare Individual Genes with Known Data

  • VC0028 fold change: 1.65 pvalue:.0474 and 1.27 .0692
  • first is sig changed, second is not
  • VC0941 -.28 .1636 and .88 .0001
  • VC0869 Fold changes: 2.12, 1.50, 1.59, 1.95, 2.20

P value: 0.02, 0.0174, 0.0463, 0.0227, 0.002 All are significantly changed

  • VC0051 Fold change: 1.89, 1.92

P value: 0.016, 0.0139 Both are significantly changed

  • VC0647 fold change: -1.11 pvalue:.0003 and fold change:-0.94 pvalue:.0125 and fold change:-1.05 pvalue:.0051
  • VC0468 -.17 and pvalue:.3350
  • VC2350 fold change: -2.40 pvalue:.0130
  • VCA0583 fold change: 1.06 pvalue: .1011


Part Two

  • I used the 2010 database.
  • There were 121 errors.
  • Kristin used the 2009 database and 772 errors were detected.
  • My database is newer and therefore it makes sense that the number of errors has decreased from the previous year, because it makes sense that the change in the database from 2009 - 2010 was an improvement. The database I used probably has more entries and less bugs.
  • We were an increased pair, which I labeled red.
  • I labeled decreased with green.

Gene Ontology Results

  1. branched chain family amino acid metabolic process
  2. branched chain family amino acid biosynthetic process
  3. IMP metabolic process
  4. IMP biosynthetic process
  5. purine ribonucleoside monophosphate biosynthetic process
  6. purine ribonucleoside monophosphate metabolic process
  7. purine nucleoside monophosphate metabolic process
  8. purine nucleoside monophosphate biosynthetic process
  9. 'de novo' IMP biosynthetic process
  10. arginine metabolic process
  • Our results were completely different. There must have been some significant findings in the year 2009 that uncovered significant gene changes.


Team Page

Heavy Metal HaterZ

Assignments

Individual Journal Entries

Shared Journal Entries