Difference between revisions of "Blitvak Week 8"
From LMU BioDB 2015
(added first part of statistics steps) |
(added some more to stat section) |
||
Line 16: | Line 16: | ||
*The equation for each column was checked to ensure that it was correct (ex. for ''A2_scaled_centered'', the equation should be <code>=(C4-C$2)/C$3</code>) | *The equation for each column was checked to ensure that it was correct (ex. for ''A2_scaled_centered'', the equation should be <code>=(C4-C$2)/C$3</code>) | ||
===Performing statistical analysis on the ratios=== | ===Performing statistical analysis on the ratios=== | ||
+ | ====Initial statistical analysis==== | ||
*A new worksheet was inserted with the name ''statistics'' | *A new worksheet was inserted with the name ''statistics'' | ||
*The first ''ID'' column in the ''scaling_centering'' worksheet was copied and pasted into the first column of ''statistics'' | *The first ''ID'' column in the ''scaling_centering'' worksheet was copied and pasted into the first column of ''statistics'' | ||
Line 29: | Line 30: | ||
*The top cell in the next column was labeled with ''Pvalue''; the equation <code>=TDIST(ABS(R2),2,2)</code> was entered in the cell below the label and copied and pasted into all of the rows in that column. The first "2" in that equation is the degrees of freedom (there are 2 degrees of freedom since the number of replicates, which is 3, minus 1 is 2) | *The top cell in the next column was labeled with ''Pvalue''; the equation <code>=TDIST(ABS(R2),2,2)</code> was entered in the cell below the label and copied and pasted into all of the rows in that column. The first "2" in that equation is the degrees of freedom (there are 2 degrees of freedom since the number of replicates, which is 3, minus 1 is 2) | ||
====Calculating the Bonferroni p-value Correction==== | ====Calculating the Bonferroni p-value Correction==== | ||
+ | *Adjustments to the p-value were performed with the purpose of correcting for the multiple testing problem. The next two columns, to the right, in ''statistics'' were both labeled with ''Bonferroni_Pvalue'' | ||
+ | *The equation <code>=S2*5221</code> was typed into the first cell under the first ''Bonferroni_Pvalue'' header; the formula was copied down the entire column | ||
+ | *Any corrected p-value that is greater than 1 was replaced with the number 1 by typing <code>=IF(T2>1,1,T2)</code> into the first cell below the second ''Bonferroni_Pvalue'' header; the formula was copied throughout the entire column | ||
+ | ====Calculating the Benjamini & Hochberg p-value Correction==== | ||
+ | *A new worksheet named ''B-H_Pvalue'' was inserted | ||
+ | *The ''ID'' column from the previous worksheet was copied and pasted into the first column of the new worksheet | ||
+ | *A new column was inserted to the very left and labeled as ''MasterIndex'' (purpose is to create a numerical index of genes) | ||
+ | **In the ''MasterIndex'' column, a "1" was typed into cell A2 and a "2" was typed into cell A3 | ||
+ | **Both cells were selected and the bottom-right corner (where the cursor becomes a thin black plus sign) was double-clicked. This filled the entire column with the numbers 1 to 5221 (# of genes) | ||
+ | |||
Revision as of 04:43, 24 October 2015
Contents
Statistical Analysis of Vibrio cholerae Microarray Data (Part 1)
- Merrell_Compiled_Raw_Data_Vibrio.xls was downloaded, saved to the desktop, and renamed with some additional information (initials and the date)
Normalizing the log ratios for the set of slides in the experiment
The following operations were performed in order to scale and center the microarray data:
- The renamed Excel file was opened and a new Worksheet was inserted with the name scaled_centered
- Everything on the compiled_raw_data worksheet was selected and copied over to scaled_centered (formatting was the same, starting from the left-hand cell, A1)
- Two new rows were inserted between the top row of headers and the first data row in scaled_centered
- In cell A2, Average was typed in; in A3, StdDev was typed in
- The Average log ratio for each chip was computed by typing
=AVERAGE(B4:B5224)
into cell B2 and pressing enter - The Standard Deviation of the log ratios on each chip was computed by typing
=STDEV(B4:B5224)
into cell B3 and pressing enter - The equations in B2 and B3 were copied and pasted into the empty cells in the rest of the columns (A2 to C4)
- The column headings for all of the data columns were copied and pasted to the right of the last data column; this new set of headers was edited so that they read: A1_scaled_centered, A2_scaled_centered, etc.
- The equation
=(B4-B$2)/B$3
was typed into cell N4; the dollar sign symbols were used in front of the "2" and "3" in order to ensure that Excel will not change the reference to that row when that same equation is pasted down the entire column of 5221 genes (this is important because the average and standard deviation is the same for the entire row, and therefore, the reference must stay the same). This equation is the scaling and centering equation. - The scaling and centering equation was copied and pasted down the entire A1_scaled_centered by clicking the original cell with the equation and double-clicking the bottom right corner of the cell (cursor should change to a black plus sign prior to double-clicking)
- The scaling and centering equation was put in each of the data columns with the _scaled_centered header (was copied and pasted down the entire columns)
- The equation for each column was checked to ensure that it was correct (ex. for A2_scaled_centered, the equation should be
=(C4-C$2)/C$3
)
Performing statistical analysis on the ratios
Initial statistical analysis
- A new worksheet was inserted with the name statistics
- The first ID column in the scaling_centering worksheet was copied and pasted into the first column of statistics
- The columns that are designated with _scaled_centered were copied and pasted into the new worksheet (starting from B1); for the pasting, "Paste Special" was required in order to just paste the numerical results into the new worksheet ("Values" was selected from the "Paste Special" window)
- Rows 2 and 3 (corresponding to Average and StDev) were deleted
- The headers Avg_LogFC_A, Avg_LogFC_B, and Avg_LogFC_C were typed into the next three empty columns to the right (immediately adjacent to the last _scaled_centered column)
- The average log fold change for the replicates for each patient was computed by typing
=AVERAGE(B2:E2)
into cell N2. The equation was copied and pasted down the entire column. - The equation for the average log fold change was created for patients B and C; this equation was copied and pasted down their respective columns (for patient B the equation was
=AVERAGE(F2:I2)
, for patient C the equation was=AVERAGE(J2:M2)
) - In the first cell that corresponds to the next empty column, the header Avg_LogFC_all was typed
- An equation that can compute the average of the three previously calculated averages was created (
=AVERAGE(N2:P2)
); this equation was pasted into this entire column (Avg_LogFC_all) - A new column was inserted next to Avg_LogFC_all. This column was given the label/header of Tstat (purpose of this column is to compute a T statistic that will inform whether
- The equation
=AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(3))
was entered into Tstat and pasted into all of the rows within that column - The top cell in the next column was labeled with Pvalue; the equation
=TDIST(ABS(R2),2,2)
was entered in the cell below the label and copied and pasted into all of the rows in that column. The first "2" in that equation is the degrees of freedom (there are 2 degrees of freedom since the number of replicates, which is 3, minus 1 is 2)
Calculating the Bonferroni p-value Correction
- Adjustments to the p-value were performed with the purpose of correcting for the multiple testing problem. The next two columns, to the right, in statistics were both labeled with Bonferroni_Pvalue
- The equation
=S2*5221
was typed into the first cell under the first Bonferroni_Pvalue header; the formula was copied down the entire column - Any corrected p-value that is greater than 1 was replaced with the number 1 by typing
=IF(T2>1,1,T2)
into the first cell below the second Bonferroni_Pvalue header; the formula was copied throughout the entire column
Calculating the Benjamini & Hochberg p-value Correction
- A new worksheet named B-H_Pvalue was inserted
- The ID column from the previous worksheet was copied and pasted into the first column of the new worksheet
- A new column was inserted to the very left and labeled as MasterIndex (purpose is to create a numerical index of genes)
- In the MasterIndex column, a "1" was typed into cell A2 and a "2" was typed into cell A3
- Both cells were selected and the bottom-right corner (where the cursor becomes a thin black plus sign) was double-clicked. This filled the entire column with the numbers 1 to 5221 (# of genes)
Working with: 2009 Vibrio Cholera database
- 772 errors were detected in the raw data by genMAPP using the 2009 database
- 121 errors were detected in the raw data by genMAPP using the 2010 database by my partner, Anindita V.
Results of 10/20 Work Session
Results of 10/22 Work Session
- Top 10 Ranked GO Terms
- macromolecule metabolic process
- localization
- transporter activity
- cellular macromolecule metabolic process
- transport
- establishment of localization
- cell projection organization
- cellular biopolymer metabolic process
- macromolecule biosynthetic process
- biopolymer metabolic process
- Analysis of the .EX.txt file produced by GenMAPP, via Excel, revealed that the 772 errors were: Gene not found in OrderedLocusNames or any related system; this suggests that the 2009 database did not
increased expression
Brandon Litvak
BIOL 367, Fall 2015
Weekly Assignments | Individual Journal Pages | Shared Journal Pages |
---|---|---|
|
|
|