Difference between revisions of "Anuvarsh Week 8"

From LMU BioDB 2015
Jump to: navigation, search
m (Sanity Check: Compare individual genes with known data: made answers to questions bold)
(Map Onto Biological Pathways (GenMAPP & MAPPFinder: filled in some answers)
Line 110: Line 110:
 
==Map Onto Biological Pathways (GenMAPP & MAPPFinder==
 
==Map Onto Biological Pathways (GenMAPP & MAPPFinder==
 
*I launched the GenMAPP software as it was installed on the lab computers in Seaver 120 as of October 20, 2015.
 
*I launched the GenMAPP software as it was installed on the lab computers in Seaver 120 as of October 20, 2015.
*I downloaded the 2010 ''Vibrio cholerae'' Gene Database using the link on the OPEN WET WARE PAGE (change that), which is also available in the .zip file uploaded at the top of this journal
+
*'''I downloaded the 2010 ''Vibrio cholerae'' Gene Database''' using [http://sourceforge.net/projects/xmlpipedb/files/V.%20cholerae%20Gene%20Database/V.%20cholerae%2020101022/Vc-Std_External_20101022.zip/download the link on the OpenWetWare page], which is also available in the .zip file uploaded at the top of this journal
**My partner, Brandon L, used the 2009 version of this species' gene database
+
**'''My partner, Brandon L, used the 2009 version of this species' gene database'''
 
*After uploading the 2010 VC Gene Database onto GenMAPP, I followed the following procedure to upload data from the previously made Tab delimited .txt file for GenMAPP
 
*After uploading the 2010 VC Gene Database onto GenMAPP, I followed the following procedure to upload data from the previously made Tab delimited .txt file for GenMAPP
  
Line 118: Line 118:
 
*I selected the tab-delimited text file that I formatted for GenMAPP (.txt). Since the only data included in this file are numerical values, I did not select any boxes in the Data Type Specification window that appeared.  
 
*I selected the tab-delimited text file that I formatted for GenMAPP (.txt). Since the only data included in this file are numerical values, I did not select any boxes in the Data Type Specification window that appeared.  
 
*The Expression Dataset Manger then converted my data for several seconds.
 
*The Expression Dataset Manger then converted my data for several seconds.
**When uploading my tab delimited file to the GenMAPP software for conversion, 121 errors were detected in my raw data.
+
**'''When uploading my tab delimited file to the GenMAPP software for conversion, 121 errors were detected in my raw data.'''
**My partner, Brandon L., had 772 errors detected in his raw data conversion while using the 2009 Vibrio Cholera database.  
+
**'''My partner, Brandon L., had 772 errors detected in his raw data conversion while using the 2009 Vibrio Cholera database.'''
***This difference in errors is most likely due to the extensiveness of our databases. Because Brandon was using an older version of the database while I was using a newer version, it is possible that his version had fewer genes than mine.
+
***'''This difference in errors is most likely due to the extensiveness of our databases. Because Brandon was using an older version of the database while I was using a newer version, it is possible that his version had fewer genes than mine.'''
 
*I then customized the new Expression Dataset by creating Color Sets that contain instructions to GenMAPP for displaying data on MAPPs
 
*I then customized the new Expression Dataset by creating Color Sets that contain instructions to GenMAPP for displaying data on MAPPs
 
*To do this, I created a color set named "LogFoldChange" and selected "Avg_LogFC_all" as the data that should be used as the Gene Value. I activated the criteria builder by clicking the New button
 
*To do this, I created a color set named "LogFoldChange" and selected "Avg_LogFC_all" as the data that should be used as the Gene Value. I activated the criteria builder by clicking the New button
Line 135: Line 135:
 
*I then created a meaningful name for my results file, and clicked Run MAPPFinder. MAPPFinder took several seconds to run. After MAPPFinder finishes running, a Gene Ontology browser opened showing my results.
 
*I then created a meaningful name for my results file, and clicked Run MAPPFinder. MAPPFinder took several seconds to run. After MAPPFinder finishes running, a Gene Ontology browser opened showing my results.
 
**The first four times that I tried to run MAPPFinder, the Gene Ontology browser did not appear. I initially thought it was a problem with the program itself, so I backtracked and reuploaded my data to GenMAPP and tried again. After this did not work, I changed the file name for my results to "mappfinder results" and GenMAPP worked. Prior to this, I attempted to save my file as "Merrell_Compiled_Raw_Data_Vibrio_MAPPFinder_AV_20151015" which appeared to not follow the GenMAPP guidelines for file names.
 
**The first four times that I tried to run MAPPFinder, the Gene Ontology browser did not appear. I initially thought it was a problem with the program itself, so I backtracked and reuploaded my data to GenMAPP and tried again. After this did not work, I changed the file name for my results to "mappfinder results" and GenMAPP worked. Prior to this, I attempted to save my file as "Merrell_Compiled_Raw_Data_Vibrio_MAPPFinder_AV_20151015" which appeared to not follow the GenMAPP guidelines for file names.
*I then selected "Show Ranked List" to see the most significant Gene Ontology terms for my data. The top 10 Gene Ontology results were as follows:  
+
*I then selected "Show Ranked List" to see the most significant Gene Ontology terms for my data. '''The top 10 Gene Ontology results were as follows:'''
*#branched chain family amino acid metabolic process
+
*#'''branched chain family amino acid metabolic process'''
*#branched chain family amino acid biosynthetic process
+
*#'''branched chain family amino acid biosynthetic process'''
*#IMP metabolic process
+
*#'''IMP metabolic process'''
*#IMP biosyntehtic process
+
*#'''IMP biosyntehtic process'''
*#purine ribonucleoside monophosphate metabolic process
+
*#'''purine ribonucleoside monophosphate metabolic process'''
*#purine nucleoside monophosphate metabolic process
+
*#'''purine nucleoside monophosphate metabolic process'''
*#purine nucleoside monophosphate biosyntehtic process
+
*#'''purine nucleoside monophosphate biosyntehtic process'''
*#purine ribonucleoside monophosphate biosyntehtic process
+
*#'''purine ribonucleoside monophosphate biosyntehtic process'''
*#arginine metabolic process
+
*#'''arginine metabolic process'''
*#ccellular nitrogen compound biosynthetic process
+
*#'''cellular nitrogen compound biosynthetic process'''
**'''TODO: MY PARTNER'S GO LIST COMPARISON'''
+
**'''Both myself and Brandon had the same top GO terms in the same order. This is most likely because in both cases, these are the GO terms that best fit a larger number of our genes. Despite discrepancies regarding extensiveness between our databases, the listing of top GO terms represents the GO terms that most of the genes in our data fall under. Even with some fewer genes being associated with GO terms in Brandon's 2009 database, the majority of genes still match up to GO terms the same way.'''
*I then searched all of the genes mentioned by Merrell et al. by typing the identifier for these genes into the MAPPFinder browser gene ID search field and selected "OrderedLocusNames" in the drop-down menu to the right of the search field. The GO terms associated with each of those genes are listed below:
+
*I then searched all of the genes mentioned by Merrell et al. by typing the identifier for these genes into the MAPPFinder browser gene ID search field and selected "OrderedLocusNames" in the drop-down menu to the right of the search field. '''The GO terms associated with each of those genes are listed below:'''
**VC0028: metal ion binding, iron-sulfur cluster binding, 4 iron 4 sulfur cluster binding, catalytic activity, lyase activity
+
**'''VC0028: metal ion binding, iron-sulfur cluster binding, 4 iron 4 sulfur cluster binding, catalytic activity, lyase activity'''
**VC0941: pyridoxal phosphate binding, catalytic activity, transferase activity, glycine hydroxymethyltransferase activity
+
**'''VC0941: pyridoxal phosphate binding, catalytic activity, transferase activity, glycine hydroxymethyltransferase activity'''
**VC0869: nucleotide binding, ATP binding, catalytic activity, lyase activity
+
**'''VC0869: nucleotide binding, ATP binding, catalytic activity, lyase activity'''
**VC0051: nucleotide binding, ATP binding, catalytic activity, lyase activity, carboxy-lyase activity
+
**'''VC0051: nucleotide binding, ATP binding, catalytic activity, lyase activity, carboxy-lyase activity'''
**VC0647: 3'-5'-exoribonuclease activity, transferase activity, nucleotidyltransferase activity, polyribonucleotide nucleotidyltransferase activity
+
**'''VC0647: 3'-5'-exoribonuclease activity, transferase activity, nucleotidyltransferase activity, polyribonucleotide nucleotidyltransferase activity'''
**VC0468: metal ion binding, nucleotide binding, ATP binding, catalytic activity, ligase activity, glytathione synthase activity
+
**'''VC0468: metal ion binding, nucleotide binding, ATP binding, catalytic activity, ligase activity, glytathione synthase activity'''
**VC2350: catalytic activity, lyase activity, deoxyribose-phosphate aldolase activity
+
**'''VC2350: catalytic activity, lyase activity, deoxyribose-phosphate aldolase activity'''
**VCA0583: outer membrane-bounded periplasmic space
+
**'''VCA0583: outer membrane-bounded periplasmic space'''
**My partner did not have the same results. In fact, my partner only found GO terms for one of the mentioned genes. This is mostly likely because he is using an older Gene Database that does not have information regarding the mentioned genes.
+
**'''My partner did not have the same results. In fact, my partner only found GO terms for one of the mentioned genes. This is mostly likely because he is using an older Gene Database that does not have information regarding the mentioned genes.'''
*I then clicked on the GO term "outer membrane-bounded periplasmic space" and a MAPP open opened that listed all of the genes associated with that GO term. I found VCA0583 in that list by using the [http://www.uniprot.org/ Uniprot site] to find the identifier for that gene, Q9KM06.
+
*'''I then clicked on the GO term "outer membrane-bounded periplasmic space" and a MAPP open opened that listed all of the genes associated with that GO term. I found VCA0583 in that list by using the [http://www.uniprot.org/ Uniprot site] to find the identifier for that gene, Q9KM06.'''
**This gene did not have a significant change in expression
+
**'''This gene did not have a significant change in expression'''
 
*I then double clicked on this gene in order to find out the functions of this gene
 
*I then double clicked on this gene in order to find out the functions of this gene
**This gene is responsible for transporter activity, transport, and outer membrane-bounded periplasmic space
+
**'''This gene is responsible for transporter activity, transport, and outer membrane-bounded periplasmic space'''
 
*I then created a copy of the results file (5-Criterion0-GO.txt) and renamed it to identify it as the MAPPFinder results sheet (This file can be found in the .zip file mentioned at the top of this journal)
 
*I then created a copy of the results file (5-Criterion0-GO.txt) and renamed it to identify it as the MAPPFinder results sheet (This file can be found in the .zip file mentioned at the top of this journal)
 
*I opened this file in Microsoft Excel.
 
*I opened this file in Microsoft Excel.
*'''TO DO:COMPARE THE DATA AT THE TOP OF MY EXCEL SHEET WITH THE DATA AT THE TOP OF BRANDON'S SHEET'''
+
*'''Below is an image of the top of my Criterion file:'''
 +
*[[Image:top-of-criterion-sheet-AV-10162015.png]]
 +
*'''Compare this information with your partner who used a different version of the Vibrio Gene Database. Which numbers are different? Why are they different? Record this information in your individual journal entry.'''
 +
**'''The numbers that were different between me and my partner were the numbers associated with:'''
 +
***'''probes meeting the filter linked to a UniProt ID'''
 +
***'''genes meeting the criterion linked to a GO term'''
 +
***'''probes linked to a UniProt ID'''
 +
***'''genes linked to a GO term'''
 +
***'''N and R distinct genes in the GO'''
 +
**'''In all instances, the numbers that appeared on my criterion sheet were higher than those on Brandon's. This is most likely because the older version of the database that Brandon was using was not as extensive in its linking of probes to UniProt ID's and genes to GO terms. Therefore, the number of probes and genes linked overall and within the filter were fewer in his results than mine. This would also affect the N and R values that are used to determine the z score.'''
 
*I then filtered the data to show the top GO terms represented in my data for both the "increased" and "decreased" criteria. The following are the filters that I applied:
 
*I then filtered the data to show the top GO terms represented in my data for both the "increased" and "decreased" criteria. The following are the filters that I applied:
 
**<code>Z Score (in column N) greater than 2</code>
 
**<code>Z Score (in column N) greater than 2</code>
Line 171: Line 180:
 
**These filters returned the top 20 GO terms  
 
**These filters returned the top 20 GO terms  
 
*The GO terms that are closely related to each other were highlighted in similar colors in the Excel spreadsheet (This file can be found in the .zip file at the top of this journal)
 
*The GO terms that are closely related to each other were highlighted in similar colors in the Excel spreadsheet (This file can be found in the .zip file at the top of this journal)
*'''TODO:Look at the top of the spreadsheet. There are rows of information that give you the background information on how MAPPFinder made the calculations. Compare this information with your partner who used a different version of the Vibrio Gene Database. Which numbers are different? Why are they different? Record this information in your individual journal entry.'''
 
 
*'''TODO:Interpret your results. Look up the definitions for any GO terms that are unfamiliar to you. The "official" definitions for GO terms can be found at http://www.geneontology.org. You can use one of the online biological dictionaries as a supplement, if needed. Write a paragraph relating the results of this GO analysis to the experiment performed (comparing laboratory-grown and patient-derived Vibrio cholerae. You need to give a biological interpretation of what do each of these GO terms in your filtered list have to to with the pathogenecity of the bacterium? You may consult with your partner on this, but your explanation on your individual journal page needs to be in your own words. This is where the real "brain power" comes in with interpreting DNA microarray data. Even experienced scientists struggle with this part. Use your creativity as a scientist to stretch your brain in this question.'''
 
*'''TODO:Interpret your results. Look up the definitions for any GO terms that are unfamiliar to you. The "official" definitions for GO terms can be found at http://www.geneontology.org. You can use one of the online biological dictionaries as a supplement, if needed. Write a paragraph relating the results of this GO analysis to the experiment performed (comparing laboratory-grown and patient-derived Vibrio cholerae. You need to give a biological interpretation of what do each of these GO terms in your filtered list have to to with the pathogenecity of the bacterium? You may consult with your partner on this, but your explanation on your individual journal page needs to be in your own words. This is where the real "brain power" comes in with interpreting DNA microarray data. Even experienced scientists struggle with this part. Use your creativity as a scientist to stretch your brain in this question.'''
  

Revision as of 01:03, 27 October 2015

Statistical Analysis of Vibrio cholerae Microarray Data (Part 1)

  • First, I downloaded the Merrell_Compiled_Raw_Data_Vibrio.xls|Merrell_Compiled_Raw_Data_Vibrio.xls file to my Desktop and saved a copy with my initials and the current date. (This file can be accessed on the page containing the procedure for part 1 or in the .zip file uploaded at the top of this lab assignment.

Normalize the log ratios for the set of slides in the experiment

  • I created a new Worksheet in my Excel file and named it "scaled_centered"
  • I transferred all of the data in the "complied_raw_data" worksheet into the "scaled_centered" worksheet
  • I then created two new rows and titled them "Average" and "StdDev"
  • I computed the Average log ratio for each column of data (where each column represents one chip) using =AVERAGE(B4:B5224), and copied the formula into the rest of the row
  • Then, I computed the Standard Deviation of the log ratios using the equation =STDEV(B4:B5224), and then copied this formula into the rest of the row
  • I then went through and added names to the columns to the right of the raw data with following the format "A1_scaled_centered","A2_scaled_centered_, etc.
  • In cell N4, I typed the following equation: =(B4-B$2)/B$3 and copied this formula into the rest of the column
    • This step is important to do because it normalizes the averages of the data at 0.
  • I repeated this step for all of the "_scaled_centered" columns

Perform statistical analysis on the ratios

  • After creating a new worksheet called "statistics", I copied and pasted the "ID" column from the "scaled_centered" worksheet
  • I then copied over all of the "_scaled_centered" columns from the "scaled_centered" sheet into the "statistics" sheet using Past Special>Values, and deleted the rows titled "Average" and "StDev"
  • I then titled three new columns "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C"
  • Under each of these columns, I calculated the average log fold change with the equation =AVERAGE(B2:E2) where the designated cells selected correspond to the scaled and centered data for that patient ("Avg_LogFC_A" has cells from all of the "A#_scaled_centered" columns)
  • The previous step was modified and repeated for each of the other "Avg_LogFC" columns
  • I then computed the averages of the averages in the column titled "Avg_LogFC_all"
  • I created a new header for the column next to "Avg_LogFC_all" and computed the Tstat with the formula =AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates))
  • I then created another header for the column adjacent to the "Tstat" column and named it "Pval"
  • In this column, I calculated the P value using the equation: =TDIST(ABS(R2),degrees of freedom,2)

Calculate the Bonferroni p value correction

  • I created two columns to the right of the "Pval" column and named both of them "Bonferroni_Pvalue"
  • In the first column, I wrote the equation =S2*5221 in the first row and then copied the formula throughout the column
  • In the second column, I created a condition that replaced any p value greater than 1 with a 1 using the equation =IF(T2>1,1,T2)

Calculate the Benjamini & Hochberg p value Correction

  • I created a new worksheet called "B-H_Pvaue"
    • This is important because we will be reordering the data and we want to prevent the rest of the sheet from being affected by the calculations required for the B-H p value
  • I copied the ID column from the previous worksheets into this sheet, and created a "MasterIndex" sheet to the left of the ID column
  • In the MasterIndex column, I typed 1 and 2 into cells A2 and A3, selected these two cells, and then double clicked on the black + sign on the corner of the cell to fill every cell in this cell with corresponding data in the "ID" column
  • I then copied the p values column from the previous sheet into Column C using Paste Special>Values
  • I selected all three columns with data in them and sorted them by ascending values in column C.
  • I then created a new column titled "Rank" and numbered each of the rows in that column using the same strategy as the "MasterIndex" column
  • Two new columns were titled "B-H_Pvalue"
  • In the first "B-H_Pvalue" column, I entered the equation =(C2*5221)/D2 and copied that formula into the rest of the column
  • In the second "B-H_Pvalue" column I created a condition where any B-H p value that is greater than 1 is replaced with the number 1 using the formula =IF(E2>1,1,E2), and copied the formula to the rest of the column
  • I then selected columns A through F and reordered them in ascending order by the Master Index and copies the last "B-H_Pvalue" column into my "statistics" sheet

Prepare file for GenMAPP

  • A new sheet was created and named "forGenMAPP"
  • Everything was copied from the statistics sheet into this sheet using Paste Special>Values
  • All of the values in the fold change columns were modified to only show 2 decimal places
  • All of the columns containing p values were modified to show only 4 decimal places
  • The left-most Bonferroni p value was removed, leaving behind only the column containing the "if" statement
  • A new column was added to the left of the "ID" column was created with the title "SystemCode" and was populated by the value "N" in every row
  • This worksheet was saved as a tab delimited *.txt file

Sanity Check: Number of genes significantly change

  • To determine how many genes fall under each pvalue category, I added a filter to all rows of data and filtered the data according to the values indicated.
    • 948 (18.2%) genes have a p value < 0.05
    • 235 (4.5%) genes have a p value < 0.01
    • 21 (0.4%) genes have a p value < 0.001
    • 0 (0%) genes have a p value < 0.0001
  • Two methods were applies in order to correct the p value and adjust it for the number of Ttests I ran. To determine how many genes fell under p value < 0.05, I filtered the data according to the values indicated in the columns indicated.
    • 0 (0%) genes have a Bonferroni-corrected p value < 0.05
    • 0 (0%) genes have a Benjamini and Hochberg-corrected p value < 0.05
  • To determine which genes had a positive or negative fold change, I filtered all of the genes from the "Avg_LogFC_all" column with the indicated conditions.
    • 352 (6.7%) genes have a positive average log fold change in the "Avg_LogFC_all" column and a p value < 0.05
    • 596 (11.4%) genes have a negative average log fold change in the "Avg_LogFC_all" column and a p value < 0.05
    • 339 (6.4%) genes have an significant positive average log fold change (> 0.25) in the "Avg_LogFC_all" column and a p value < 0.05
    • 578 (11.1%) genes have an significant negative average log fold change (> 0.25) in the "Avg_LogFC_all" column and a p value < 0.05
  • Question: What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method?
    • Merrell et al. determined statistically significant differences in gene expression using the Statistical Analysis for Microarrays (SAM) program. According to the article, genes that were determined to have statistically significant change were genes that had at least a two fold change in expression. Our method used p values to test for significant differences between experimental and control data.

Sanity Check: Compare individual genes with known data

Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. However, in our data...

  • (All genes were checked by searching the gene name in the "ID" column of the Excel spreadsheet.)
  • VC0028 (two entries exist)
    • Fold Change: 1.67, 1.27
    • P value: 0.0474, 0.0692
    • Significantly changed? The first entry of this gene was significantly changed, but not the second.
  • VC0941 (two entries exist)
    • Fold Change: 0.09, -0.28
    • P value: 0.6759, 0.1636
    • Significantly changed? Neither entry was significantly changed.
  • VC0869 (five entries exist)
    • Fold Change: 1.59, 1.95, 2.2, 1.5, 2.12
    • P value: 0.0463, 0.0227, 0.002, 0.0174, 0.02
    • Significantly changed? All entries were significantly changed.
  • VC0051 (two entries exist)
    • Fold Change: 1.92, 1.89
    • P value: 0.0139, 0.016
    • Significantly changed? Both entries were significantly changed.
  • VC0647 (three entries exist)
    • Fold Change: -1.11, -0.94, -1.05
    • P value: 0.0003, 0.0125, 0.0051
    • Significantly changed? All entries were significantly changed.
  • VC0468 (one entry exists)
    • Fold Change: -0.17
    • P value: 0.335
    • Significantly changed? Not significantly changed.
  • VC2350 (one entry exists)
    • Fold Change: -2.4
    • P value: 0.013
    • Significantly changed? Significantly changed.
  • VCA0583 (one entry exists)
    • Fold Change: 1.06
    • P value: 0.1011
    • Significantly changed? Not significantly changed.

MAPPFinder Analysis of Vibrio cholerae Microarray Data (Part 2)

Map Onto Biological Pathways (GenMAPP & MAPPFinder

  • I launched the GenMAPP software as it was installed on the lab computers in Seaver 120 as of October 20, 2015.
  • I downloaded the 2010 Vibrio cholerae Gene Database using the link on the OpenWetWare page, which is also available in the .zip file uploaded at the top of this journal
    • My partner, Brandon L, used the 2009 version of this species' gene database
  • After uploading the 2010 VC Gene Database onto GenMAPP, I followed the following procedure to upload data from the previously made Tab delimited .txt file for GenMAPP

GenMAPP Expression Dataset Manager Procedure

  • I opened the Expression Dataset Manager from the Data menu in the main drafting board window and selected "New Dataset"
  • I selected the tab-delimited text file that I formatted for GenMAPP (.txt). Since the only data included in this file are numerical values, I did not select any boxes in the Data Type Specification window that appeared.
  • The Expression Dataset Manger then converted my data for several seconds.
    • When uploading my tab delimited file to the GenMAPP software for conversion, 121 errors were detected in my raw data.
    • My partner, Brandon L., had 772 errors detected in his raw data conversion while using the 2009 Vibrio Cholera database.
      • This difference in errors is most likely due to the extensiveness of our databases. Because Brandon was using an older version of the database while I was using a newer version, it is possible that his version had fewer genes than mine.
  • I then customized the new Expression Dataset by creating Color Sets that contain instructions to GenMAPP for displaying data on MAPPs
  • To do this, I created a color set named "LogFoldChange" and selected "Avg_LogFC_all" as the data that should be used as the Gene Value. I activated the criteria builder by clicking the New button
  • Two criterion were created: one for increased expression of data and another for decreased expression of data
    • "Increased" was defined as [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05
    • "Decreased was defined as [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05
  • I saved the entire Expression Dataset by selected Save from the Expression Dataset menu and then exited from the Expression Dataset Manager to view the Color Sets on a MAPP

MAPPFinder Procedure

  • I launched the MAPPFinder program from within GenMAPP by selecting MAPPFinder under the Tools menu on GenMAPP
  • After ensuring that the Gene Database for the correct species was loaded, I selected the button "Calculate New Results"
  • I clicked on "Find File" and selected the .gex file that contained my Expression Dataset (This file can be found in the .zip file at the beginning of this journal)
  • I chose the LogFoldChange color set and the Increased criteria in the right-hand box, and selected the boxes for "Gene Ontology" and "p value"
  • I then created a meaningful name for my results file, and clicked Run MAPPFinder. MAPPFinder took several seconds to run. After MAPPFinder finishes running, a Gene Ontology browser opened showing my results.
    • The first four times that I tried to run MAPPFinder, the Gene Ontology browser did not appear. I initially thought it was a problem with the program itself, so I backtracked and reuploaded my data to GenMAPP and tried again. After this did not work, I changed the file name for my results to "mappfinder results" and GenMAPP worked. Prior to this, I attempted to save my file as "Merrell_Compiled_Raw_Data_Vibrio_MAPPFinder_AV_20151015" which appeared to not follow the GenMAPP guidelines for file names.
  • I then selected "Show Ranked List" to see the most significant Gene Ontology terms for my data. The top 10 Gene Ontology results were as follows:
    1. branched chain family amino acid metabolic process
    2. branched chain family amino acid biosynthetic process
    3. IMP metabolic process
    4. IMP biosyntehtic process
    5. purine ribonucleoside monophosphate metabolic process
    6. purine nucleoside monophosphate metabolic process
    7. purine nucleoside monophosphate biosyntehtic process
    8. purine ribonucleoside monophosphate biosyntehtic process
    9. arginine metabolic process
    10. cellular nitrogen compound biosynthetic process
    • Both myself and Brandon had the same top GO terms in the same order. This is most likely because in both cases, these are the GO terms that best fit a larger number of our genes. Despite discrepancies regarding extensiveness between our databases, the listing of top GO terms represents the GO terms that most of the genes in our data fall under. Even with some fewer genes being associated with GO terms in Brandon's 2009 database, the majority of genes still match up to GO terms the same way.
  • I then searched all of the genes mentioned by Merrell et al. by typing the identifier for these genes into the MAPPFinder browser gene ID search field and selected "OrderedLocusNames" in the drop-down menu to the right of the search field. The GO terms associated with each of those genes are listed below:
    • VC0028: metal ion binding, iron-sulfur cluster binding, 4 iron 4 sulfur cluster binding, catalytic activity, lyase activity
    • VC0941: pyridoxal phosphate binding, catalytic activity, transferase activity, glycine hydroxymethyltransferase activity
    • VC0869: nucleotide binding, ATP binding, catalytic activity, lyase activity
    • VC0051: nucleotide binding, ATP binding, catalytic activity, lyase activity, carboxy-lyase activity
    • VC0647: 3'-5'-exoribonuclease activity, transferase activity, nucleotidyltransferase activity, polyribonucleotide nucleotidyltransferase activity
    • VC0468: metal ion binding, nucleotide binding, ATP binding, catalytic activity, ligase activity, glytathione synthase activity
    • VC2350: catalytic activity, lyase activity, deoxyribose-phosphate aldolase activity
    • VCA0583: outer membrane-bounded periplasmic space
    • My partner did not have the same results. In fact, my partner only found GO terms for one of the mentioned genes. This is mostly likely because he is using an older Gene Database that does not have information regarding the mentioned genes.
  • I then clicked on the GO term "outer membrane-bounded periplasmic space" and a MAPP open opened that listed all of the genes associated with that GO term. I found VCA0583 in that list by using the Uniprot site to find the identifier for that gene, Q9KM06.
    • This gene did not have a significant change in expression
  • I then double clicked on this gene in order to find out the functions of this gene
    • This gene is responsible for transporter activity, transport, and outer membrane-bounded periplasmic space
  • I then created a copy of the results file (5-Criterion0-GO.txt) and renamed it to identify it as the MAPPFinder results sheet (This file can be found in the .zip file mentioned at the top of this journal)
  • I opened this file in Microsoft Excel.
  • Below is an image of the top of my Criterion file:
  • Top-of-criterion-sheet-AV-10162015.png
  • Compare this information with your partner who used a different version of the Vibrio Gene Database. Which numbers are different? Why are they different? Record this information in your individual journal entry.
    • The numbers that were different between me and my partner were the numbers associated with:
      • probes meeting the filter linked to a UniProt ID
      • genes meeting the criterion linked to a GO term
      • probes linked to a UniProt ID
      • genes linked to a GO term
      • N and R distinct genes in the GO
    • In all instances, the numbers that appeared on my criterion sheet were higher than those on Brandon's. This is most likely because the older version of the database that Brandon was using was not as extensive in its linking of probes to UniProt ID's and genes to GO terms. Therefore, the number of probes and genes linked overall and within the filter were fewer in his results than mine. This would also affect the N and R values that are used to determine the z score.
  • I then filtered the data to show the top GO terms represented in my data for both the "increased" and "decreased" criteria. The following are the filters that I applied:
    • Z Score (in column N) greater than 2
    • PermuteP (in column O) less than 0.05
    • Number Changed (in column I) greater than or equal to 5 AND less than 100
    • Percent Changed (in column L) greater than or equal to 26%
    • These filters returned the top 20 GO terms
  • The GO terms that are closely related to each other were highlighted in similar colors in the Excel spreadsheet (This file can be found in the .zip file at the top of this journal)
  • TODO:Interpret your results. Look up the definitions for any GO terms that are unfamiliar to you. The "official" definitions for GO terms can be found at http://www.geneontology.org. You can use one of the online biological dictionaries as a supplement, if needed. Write a paragraph relating the results of this GO analysis to the experiment performed (comparing laboratory-grown and patient-derived Vibrio cholerae. You need to give a biological interpretation of what do each of these GO terms in your filtered list have to to with the pathogenecity of the bacterium? You may consult with your partner on this, but your explanation on your individual journal page needs to be in your own words. This is where the real "brain power" comes in with interpreting DNA microarray data. Even experienced scientists struggle with this part. Use your creativity as a scientist to stretch your brain in this question.

Conclusion

  • Write a paragraph that briefly summarizes and gives a scientific conclusion for the work that you did this week.

Other Links

User Page: Anindita Varshneya
Class Page: BIOL/CMSI 367: Biological Databases, Fall 2015
Group Page: GÉNialOMICS

Assignment Pages

Week 1 Assignment
Week 2 Assignment
Week 3 Assignment
Week 4 Assignment
Week 5 Assignment
Week 6 Assignment
Week 7 Assignment
Week 8 Assignment
Week 9 Assignment
Week 10 Assignment
Week 11 Assignment
Week 12 Assignment
No Week 13 Assignment
Week 14 Assignment
Week 15 Assignment

Individual Journals

Individual Journal Week 2
Individual Journal Week 3
Individual Journal Week 4
Individual Journal Week 5
Individual Journal Week 6
Individual Journal Week 7
Individual Journal Week 8
Individual Journal Week 9
Individual Journal Week 10
Individual Journal Week 11
Individual Journal Week 12
Individual Journal Week 14
Individual Journal Week 15

Shared Journals

Class Journal Week 1
Class Journal Week 2
Class Journal Week 3
Class Journal Week 4
Class Journal Week 5
Class Journal Week 6
Class Journal Week 7
Class Journal Week 8
Class Journal Week 9
GÉNialOMICS Journal Week 10
GÉNialOMICS Journal Week 11
GÉNialOMICS Journal Week 12
GÉNialOMICS Journal Week 14
GÉNialOMICS Journal Week 15