Difference between revisions of "Blitvak Week 8"

From LMU BioDB 2015
Jump to: navigation, search
(added first bit of part 2 analysis)
(added more for part 2 of microarray analysis)
Line 121: Line 121:
 
**It was found that the error message for the unprocessed genes was '''Gene not found in OrderedLocusNames or any related system.'''
 
**It was found that the error message for the unprocessed genes was '''Gene not found in OrderedLocusNames or any related system.'''
 
**Compared to my partner's results (121 errors with the 2010 database), I had many more errors (772) using the 2009 database. Given that the error message for the 772 errors was ''Gene not found in OrderedLocusNames or any related system.'', it appears that the old database covers less genes than the new one (it appears that GenMAPP is giving an error due to the loaded gene database not covering some genes)
 
**Compared to my partner's results (121 errors with the 2010 database), I had many more errors (772) using the 2009 database. Given that the error message for the 772 errors was ''Gene not found in OrderedLocusNames or any related system.'', it appears that the old database covers less genes than the new one (it appears that GenMAPP is giving an error due to the loaded gene database not covering some genes)
 
+
===Creating Color Sets===
+
*Increased and Decreased LogFoldChange color sets were created in GenMAPP by going to the Expression Dataset Manager and filling in these fields in the Color Set area: name for the Color Set, the gene value, and the criteria that determines how a gene object is colored (on the MAPP)
 
+
**The name of the color set was set as LogFoldChange and Avg_LogFC_all was used as the Gene Value
 
+
**For the Increased criterion (increased LogFoldChange), the name was set as Increased, red was used as the color, and the criterion was <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>
 
+
**For the decreased criterion (decreased LogFoldChange), the name was set as Decreased, green was used as the color, and the criterion was <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
 
+
**The whole Expression Dataset was saved and the Expression Dataset Manager was exited
 
+
===MAPPFinder Procedure===
 
+
*'''Assigned criterion: Increased''', using '''2009 Database'''
 
+
*The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
 
+
*"Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
 
+
*For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
 
+
*The LogFoldChange color set was selected and the '''Increased''' criterion was selected (to filter the data)
 
+
*The boxes corresponding to "Gene Ontology" and "p value" were checked
 +
*"Browse" button was clicked to add a name to the file that will be created
 +
*"Run MAPPFinder" was clicked and the program was allowed to complete its analysis
 +
*"Show Ranked List" was clicked to see a list of the most significant Gene Ontology terms
 +
*'''Top 10 Ranked GO Terms'''
 +
#biopolymer biosynthetic process
 +
#macromolecule biosynthetic process
 +
#macromolecule metabolic process
 +
#localization
 +
#transporter activity
 +
#cellular biopolymer biosynthetic process
 +
#cellular macromolecule metabolic process
 +
#transport
 +
#establishment of localization
 +
#biopolymer metabolic process
 +
*In the main MAPPFinder Browser window, "Collapse the Tree" was clicked on and these genes were searched for (one by one): VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. The ID for the genes was put in the gene ID search field and "OrderedLocusNames" was selected from the drop-down menu to the right of the search field. GeneID search button was clicked in order to commence the search for the gene.
 +
*VC0028: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
 +
*VC0941: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
 +
*VC0869: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
 +
*VC0051: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
 +
*VC0647: 3'-5'-exoribonuclease activity, transferase activity, nucleotidyltransferase activity, polyribonucleotide nucleotidyltransferase activity
 +
*VC0468:
 +
*VC2350:
 +
*VCA0583: transport, out membrane-bounded periplasmic space, transporter activity
  
 
*[[Media:Merrell_Compiled_Raw_Data_Vibrio_BL_20151015.xls|Modified Excel File/Data]]  
 
*[[Media:Merrell_Compiled_Raw_Data_Vibrio_BL_20151015.xls|Modified Excel File/Data]]  

Revision as of 18:15, 25 October 2015

Statistical Analysis of Vibrio cholerae Microarray Data (Part 1)

Normalizing the log ratios for the set of slides in the experiment

The following operations were performed in order to scale and center the microarray data:

  • The renamed Excel file was opened and a new Worksheet was inserted with the name scaled_centered
  • Everything on the compiled_raw_data worksheet was selected and copied over to scaled_centered (formatting was the same, starting from the left-hand cell, A1)
  • Two new rows were inserted between the top row of headers and the first data row in scaled_centered
  • In cell A2, Average was typed in; in A3, StdDev was typed in
  • The Average log ratio for each chip was computed by typing =AVERAGE(B4:B5224) into cell B2 and pressing enter
  • The Standard Deviation of the log ratios on each chip was computed by typing =STDEV(B4:B5224) into cell B3 and pressing enter
  • The equations in B2 and B3 were copied and pasted into the empty cells in the rest of the columns (A2 to C4)
  • The column headings for all of the data columns were copied and pasted to the right of the last data column; this new set of headers was edited so that they read: A1_scaled_centered, A2_scaled_centered, etc.
  • The equation =(B4-B$2)/B$3 was typed into cell N4; the dollar sign symbols were used in front of the "2" and "3" in order to ensure that Excel will not change the reference to that row when that same equation is pasted down the entire column of 5221 genes (this is important because the average and standard deviation is the same for the entire row, and therefore, the reference must stay the same). This equation is the scaling and centering equation.
  • The scaling and centering equation was copied and pasted down the entire A1_scaled_centered by clicking the original cell with the equation and double-clicking the bottom right corner of the cell (cursor should change to a black plus sign prior to double-clicking)
  • The scaling and centering equation was put in each of the data columns with the _scaled_centered header (was copied and pasted down the entire columns)
  • The equation for each column was checked to ensure that it was correct (ex. for A2_scaled_centered, the equation should be =(C4-C$2)/C$3)

Performing statistical analysis on the ratios

Initial statistical analysis

  • A new worksheet was inserted with the name statistics
  • The first ID column in the scaling_centering worksheet was copied and pasted into the first column of statistics
  • The columns that are designated with _scaled_centered were copied and pasted into the new worksheet (starting from B1); for the pasting, "Paste Special" was required in order to just paste the numerical results into the new worksheet ("Values" was selected from the "Paste Special" window)
  • Rows 2 and 3 (corresponding to Average and StDev) were deleted
  • The headers Avg_LogFC_A, Avg_LogFC_B, and Avg_LogFC_C were typed into the next three empty columns to the right (immediately adjacent to the last _scaled_centered column)
  • The average log fold change for the replicates for each patient was computed by typing =AVERAGE(B2:E2) into cell N2. The equation was copied and pasted down the entire column.
  • The equation for the average log fold change was created for patients B and C; this equation was copied and pasted down their respective columns (for patient B the equation was =AVERAGE(F2:I2), for patient C the equation was =AVERAGE(J2:M2))
  • In the first cell that corresponds to the next empty column, the header Avg_LogFC_all was typed
  • An equation that can compute the average of the three previously calculated averages was created (=AVERAGE(N2:P2)); this equation was pasted into this entire column (Avg_LogFC_all)
  • A new column was inserted next to Avg_LogFC_all. This column was given the label/header of Tstat (purpose of this column is to compute a T statistic that will inform whether
  • The equation =AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(3)) was entered into Tstat and pasted into all of the rows within that column
  • The top cell in the next column was labeled with Pvalue; the equation =TDIST(ABS(R2),2,2) was entered in the cell below the label and copied and pasted into all of the rows in that column. The first "2" in that equation is the degrees of freedom (there are 2 degrees of freedom since the number of replicates, which is 3, minus 1 is 2)

Calculating the Bonferroni p-value Correction

  • Adjustments to the p-value were performed with the purpose of correcting for the multiple testing problem. The next two columns, to the right, in statistics were both labeled with Bonferroni_Pvalue
  • The equation =S2*5221 was typed into the first cell under the first Bonferroni_Pvalue header; the formula was copied down the entire column
  • Any corrected p-value that is greater than 1 was replaced with the number 1 by typing =IF(T2>1,1,T2) into the first cell below the second Bonferroni_Pvalue header; the formula was copied throughout the entire column

Calculating the Benjamini & Hochberg p-value Correction

  • A new worksheet named B-H_Pvalue was inserted
  • The ID column from the previous worksheet was copied and pasted into the first column of the new worksheet
  • A new column was inserted to the very left and labeled as MasterIndex (purpose is to create a numerical index of genes)
    • In the MasterIndex column, a "1" was typed into cell A2 and a "2" was typed into cell A3
    • Both cells were selected and the bottom-right corner (where the cursor becomes a thin black plus sign) was double-clicked. This filled the entire column with the numbers 1 to 5221 (# of genes)
  • Using Paste special > Paste values, the unadjusted p-values from the previous worksheet were copied and pasted into column C of this worksheet
  • Columns A, B, and C were all selected and sort by ascending values was performed on Column C (sort button on toolbar -> custom sort -> Sort by Pvalue, Sort on Values, Order Smallest to Largest)
  • The header Rank was typed into cell D1. "1" was typed into cell D2 and "2" was typed into cell D3; both cells were selected and the double-clicking of the lower right corner was employed in order to fill the column with a series of numbers from 1 to 5221.
  • B-H_Pvalue was typed into cell E1 and the formula =(C2*5221)/D2 was typed into cell E2; the equation was copied down the entire column
  • B-H_Pvalue was typed into cell F1 and the formula =IF(E2>1,1,E2) was typed into cell F2; the equation was copied down the entire column
  • Columns A through F were selected and the columns were sorted by the MasterIndex in column A in ascending order
  • Column F was copied and the values were pasted via Paste special into the next column on the right

Preparing file for GenMAPP

  • A new worksheet was inserted with the name forGenMAPP
  • Everything in the statistics worksheet was selected and copied over to forGenMAPP via Past special > values
  • Columns B through Q were selected and the number of decimal places was set to 2 via Format > Cells > number tab, set to 2 decimal places
  • All of the columns containing p-values were selected and the number of decimal places was set to 4
  • The left-most Bonferroni p-value column was deleted (the one with an "if" statement was kept)
  • A column to the right of ID column was inserted and given the header SystemCode. The entire column was filled with the letter "N".
  • While on the forGenMAPP worksheet, the file was saved as "Text (Tab-delimited) (*.txt)"
  • The resulting file was checked and opened via notepad

Sanity Check: Number of genes significantly changed

  • The spreadsheet was opened and the forGenMAPP worksheet was selected
  • Cell A1 was clicked and the and the autofilter was turned on via Data > Filter > Autofilter
  • The drop-down arrow on the Pvalue column was clicked. "Number filters" was selected, then "Less than...", and then "0.05" was typed into the window that appeared, in order to filter the "Pvalue" column so that only p-values that are less than 0.05 appear
    • 948 genes out of 5221 were found to have a p-value < 0.05, which is 18.16% of genes
  • The Pvalue column was then filtered so that only p-values that are less than 0.01 appear
    • 235 genes out of 5221 were found to have a p-value < 0.01, which is 4.50% of genes
  • The Pvalue column was then filtered so that only p-values that are less than 0.001 appear
    • 24 genes out of 5221 were found to have a p-value < 0.001, which is 0.46% of genes
  • The Pvalue column was then filtered so that only p-values that are less than 0.0001 appear
    • 2 genes out of 5221 were found to have a p-value < 0.0001, which is 0.038% of genes
  • Bonferroni_Pvalue was then filtered in order to determine the genes that are p < 0.05 for the Bonferroni-corrected p-value
    • 6 genes out of 5221 were found to have a p < 0.05 for the Bonferroni-corrected p-value, which is 0.115% of genes
  • B-H_Pvalue was then filtered in order to determine the genes that are p < 0.05 for the Benjamini and Hochberg-corrected p value
    • 0 genes out of 5221 were found to have a p < 0.05 for the Benjamini and Hochberg-corrected p value, which is 0% of genes
  • Avg_LogFC_all was then filtered in order to only show the genes with an average log fold change greater than zero (while keeping the p-value filter at less than 0.05)
    • 352 genes were found to have an average log fold change greater than zero, which is 6.74% of genes
  • Avg_LogFC_all was then filtered in order to only show the genes with an average log fold change less than zero (while keeping the p-value filter at less than 0.05)
    • 596 genes were found to have an average log fold change less than zero, which is 11.42% of genes
  • Avg_LogFC_all was then filtered in order to only show the genes with an average log fold change > 0.25 (while keeping the p-value filter at less than 0.05)
    • 339 genes were found to have an average log fold change > 0.25, which is 6.49% of genes
  • Avg_LogFC_all was then filtered in order to only show the genes with an average log fold change < -0.25 (while keeping the p-value filter at less than 0.05)
    • 579 genes were found to have an average log fold change < -0.25, which is 11.09% of genes
  • What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method?
    • Merrell et al. employed the Statistical Analysis for Microarrays (SAM) program with the intensity ratios in order to "identify significant differences in gene expression"; they conducted a two-class SAM analysis, with the in vitro strain being class I and the individual stool samples being class II. Merrell et al. selected genes with statistically significant changes in expression (which was at least two fold) in each patient sample and this individual stool sample data was used to identify genes that were significantly changed (in expression) in all three samples. The method used by Merrel et al. is somewhat similar to the method used in this investigation but it involved the use of an other computer program (SAM) and the selection of genes that had at least two fold changes in expression. This method used in this investigation primarily involved the use of p-values (that are less than 0.05) in order to identify significant changes in gene expression; similar to what was done by Merrel et al., this method also involved the use of the data from all three patients in order to find changes in gene expression that are significant among all three samples.

Sanity Check: Compare individual genes with known data

Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. These genes were looked up in the spreadsheet and their fold changes, p-values, and significance was noted:

  • VC0028 (2 entries were found)
    • Fold Change: first entry = 1.65, second entry = 1.27
    • P-Value: first entry = 0.0474, second entry = 0.0692
    • Significance: first entry = statistically significant, second entry = not statistically significant
  • VC0941 (2 entries were found)
    • Fold Change: first entry = 0.09, second entry = -0.28
    • P-Value: first entry = 0.6759, second entry = 0.1636
    • Significance: first entry = not statistically significant, second entry = not statistically significant
  • VC0869 (5 entries were found)
    • Fold Change (nth entry): 1 = 1.59, 2 = 1.95, 3 = 2.20, 4 = 1.50, 5 = 2.12
    • P-Value (nth entry): 1 = 0.0463, 2 = 0.0227, 3 = 0.0020, 4 = 0.0174, 5 = 0.0200
    • Significance (nth entry): 1 = significant, 2 = significant, 3 = significant, 4 = significant, 5 = significant
  • VC0051 (2 entries were found)
    • Fold Change: first entry = 1.92, second entry = 1.89
    • P-Value: first entry = 0.0139, second entry = 0.0160
    • Significance: first entry = statistically significant, second entry = statistically significant
  • VC0468
    • Fold Change: -0.17
    • P-Value: 0.3350
    • Significance: not statistically significant
  • VC2350
    • Fold Change: -2.40
    • P-Value: 0.0130
    • Significance: statistically significant
  • VCA0583
    • Fold Change: 1.06
    • P-Value: 0.1011
    • Significance: not statistically significant

Statistical Analysis of Vibrio cholerae Microarray Data (Part 2)

  • GenMAPP was launched and the 2009 Vibrio Cholera database was downloaded and loaded into the program (placed into C:\GenMAPP 2 Data\Gene Databases)
  • The data Menu from the main Drafting Board window was selected and then Expression Dataset Manager was chosen. In the Expression Dataset Manager window, New Dataset was selected and then the tab-delimited text file that was formatted for GenMAPP (.txt)
  • Expression Dataset Manager was allowed to convert the data and create a new converted dataset

Error Analysis

  • After conversion, it was found that 772 errors were detected in the raw data by genMAPP using the 2009 database
  • My partner, Anindita V., found that 121 errors were detected in the raw data by genMAPP using the 2010 database
  • The .EX.txt file generated by GenMAPP was opened (which contains error messages, along with the raw data) and analyzed
    • It was found that the error message for the unprocessed genes was Gene not found in OrderedLocusNames or any related system.
    • Compared to my partner's results (121 errors with the 2010 database), I had many more errors (772) using the 2009 database. Given that the error message for the 772 errors was Gene not found in OrderedLocusNames or any related system., it appears that the old database covers less genes than the new one (it appears that GenMAPP is giving an error due to the loaded gene database not covering some genes)

Creating Color Sets

  • Increased and Decreased LogFoldChange color sets were created in GenMAPP by going to the Expression Dataset Manager and filling in these fields in the Color Set area: name for the Color Set, the gene value, and the criteria that determines how a gene object is colored (on the MAPP)
    • The name of the color set was set as LogFoldChange and Avg_LogFC_all was used as the Gene Value
    • For the Increased criterion (increased LogFoldChange), the name was set as Increased, red was used as the color, and the criterion was [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05
    • For the decreased criterion (decreased LogFoldChange), the name was set as Decreased, green was used as the color, and the criterion was [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05
    • The whole Expression Dataset was saved and the Expression Dataset Manager was exited

MAPPFinder Procedure

  • Assigned criterion: Increased, using 2009 Database
  • The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
  • "Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
  • For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
  • The LogFoldChange color set was selected and the Increased criterion was selected (to filter the data)
  • The boxes corresponding to "Gene Ontology" and "p value" were checked
  • "Browse" button was clicked to add a name to the file that will be created
  • "Run MAPPFinder" was clicked and the program was allowed to complete its analysis
  • "Show Ranked List" was clicked to see a list of the most significant Gene Ontology terms
  • Top 10 Ranked GO Terms
  1. biopolymer biosynthetic process
  2. macromolecule biosynthetic process
  3. macromolecule metabolic process
  4. localization
  5. transporter activity
  6. cellular biopolymer biosynthetic process
  7. cellular macromolecule metabolic process
  8. transport
  9. establishment of localization
  10. biopolymer metabolic process
  • In the main MAPPFinder Browser window, "Collapse the Tree" was clicked on and these genes were searched for (one by one): VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. The ID for the genes was put in the gene ID search field and "OrderedLocusNames" was selected from the drop-down menu to the right of the search field. GeneID search button was clicked in order to commence the search for the gene.
  • VC0028: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
  • VC0941: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
  • VC0869: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
  • VC0051: "No MAPPs or GO terms could be found for that OrderedLocusNames ID."
  • VC0647: 3'-5'-exoribonuclease activity, transferase activity, nucleotidyltransferase activity, polyribonucleotide nucleotidyltransferase activity
  • VC0468:
  • VC2350:
  • VCA0583: transport, out membrane-bounded periplasmic space, transporter activity

Working with: 2009 Vibrio Cholera database

  • 772 errors were detected in the raw data by genMAPP using the 2009 database
  • 121 errors were detected in the raw data by genMAPP using the 2010 database by my partner, Anindita V.

Results of 10/20 Work Session

.gex file for GenMAPP

Results of 10/22 Work Session

  • Top 10 Ranked GO Terms
  1. macromolecule metabolic process
  2. localization
  3. transporter activity
  4. cellular macromolecule metabolic process
  5. transport
  6. establishment of localization
  7. cell projection organization
  8. cellular biopolymer metabolic process
  9. macromolecule biosynthetic process
  10. biopolymer metabolic process


  • Analysis of the .EX.txt file produced by GenMAPP, via Excel, revealed that the 772 errors were: Gene not found in OrderedLocusNames or any related system; this suggests that the 2009 database did not


increased expression


Brandon Litvak
BIOL 367, Fall 2015

Weekly Assignments Individual Journal Pages Shared Journal Pages