Difference between revisions of "Jkuroda Week 8"

From LMU BioDB 2015
Jump to: navigation, search
(genMAPP part)
(finishing up)
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
[[Media:Merrell Compiled Raw Data Vibrio JK 2015-10-20.xls | updated spreadsheet]]
+
[[Media:Kuroda week8Files.zip | '''zip file with most of the required files''']]
  
[[Media:Merrell Compiled Raw Data Vibrio JK 2015-10-20.txt | txt format]]
+
[[Media:Merrell Compiled Raw Data Vibrio JK 2015-10-22-Criterion0-GO - Colored.xlsx | colored criterion xlsx file]]
  
[[Media:Merrell Compiled Raw Data Vibrio JK 2015-10-20.EX.txt | exception file (2010 version)]]
+
[[Media:Merrell Compiled Raw Data Vibrio JK 2015-10-20.xls | updated spreadsheet]]
 
+
[[Media:Merrell Compiled Raw Data Vibrio JK 2015-10-20.gex | gex file (2010 version)]]
+
  
 
== Normalized the log ratios for the set of slides in the experiment ==
 
== Normalized the log ratios for the set of slides in the experiment ==
Line 85: Line 83:
 
* Selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu.  My new *.txt file is now ready for import into GenMAPP.  
 
* Selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu.  My new *.txt file is now ready for import into GenMAPP.  
 
** Uploaded both the .xls and .txt files that you have just created to my journal page in the class wiki.
 
** Uploaded both the .xls and .txt files that you have just created to my journal page in the class wiki.
 +
 +
== Sanity Checked: Number of genes significantly changed ==
 +
 +
Before I moved on to the GenMAPP/MAPPFinder analysis, I performed a sanity check to make sure that I performed my data analysis correctly.  I found out the number of genes that were significantly changed at various p value cut-offs and also compared my data analysis with the published results of Merrell et al. (2002).
 +
 +
* Opened my spreadsheet and went to the "forGenMAPP" tab.
 +
* Clicked on cell A1 and selected the menu item Data > Filter > Autofilter.
 +
* Clicked on the drop-down arrow on my "Pvalue" column.  Selected "Custom".  In the window that appeared, I set a criterion that will filter my data so that the Pvalue has to be less than 0.05.
 +
** '''''How many genes have p value < 0.05? and what is the percentage (out of 5221)?'''''
 +
948 genes for a percentage of ~18%
 +
** '''''What about p < 0.01? and what is the percentage (out of 5221)?'''''
 +
235 genes for a percentage of ~5%
 +
** '''''What about p < 0.001? and what is the percentage (out of 5221)?'''''
 +
24 genes for a percentage of ~0.5%
 +
** '''''What about p < 0.0001? and what is the percentage (out of 5221)?'''''
 +
2 genes for a percentage of ~0.05%
 +
* When I use a p value cut-off of p < 0.05, what I am saying is that I would have seen a gene expression change that deviates this far from zero less than 5% of the time.
 +
* I have just performed 5221 T tests for significance.  Another way to state what I am seeing with p < 0.05 is that I would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. Since I have more than 261 genes that pass this cut off, I know that some genes are significantly changed.  However, I don't know ''which'' ones.  To apply a more stringent criterion to my p values, I performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values.  The Bonferroni correction is very stringent.  The Benjamini-Hochberg correction is less stringent.  To see this relationship, I filtered my data to determine the following:
 +
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 5221)?'''''
 +
0 genes for a percentage of ~0%
 +
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 5221)?'''''
 +
0 genes for a percentage of ~0%
 +
* In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant".  Instead, it is a movable confidence level.  If I want to be very confident of our data, I would use a small p value cut-off.  If I am OK with being less confident about a gene expression change and want to include more genes in my analysis, I can use a larger p value cut-off. 
 +
* The "Avg_LogFC_all" tells me the size of the gene expression change and in which direction.  Positive values increases relative to the control; negative values decreases relative to the control.
 +
** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero.  '''''How many are there? (and %)'''''
 +
352 genes for a percentage of ~7%
 +
** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero.  '''''How many are there? (and %)'''''
 +
596 genes for a percentage of ~11%
 +
** '''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
 +
339 genes for a percentage of ~6%
 +
** '''''Or an average log fold change of < -0.25 and p < 0.05? (and %)'''''  (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
 +
579 genes for a percentage of ~11%
 +
* For the GenMAPP analysis below, I will use the fold change cut-off of greater than 0.25 or less than -0.25 and the unadjusted p value cut off of p < 0.05 for my analysis because we want to include several hundred genes in my analysis.
 +
* '''''What criteria did Merrell et al. (2002) use to determine a significant gene expression change?  How does it compare to our method?'''''
 +
According to Merrell et al. (2002): "A two-class SAM analysis was conducted using the strain grown in vitro as class I, and each individual stool sample as class II. Genes with statistically significant changes in the level of expression—at least a twofold change—in each patient sample were chosen, and the derived data from individual stool samples were collapsed to identify genes that were differentially regulated in all three samples."
 +
This is different from our method in that they did not seem to use p-values to determine significance, where we used solely p-values.
 +
 +
== Sanity Checked:  Compared individual genes with known data ==
 +
 +
* Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data.  Look these genes up in your spreadsheet.  '''''What are their fold changes and p values? Are they significantly changed in our analysis?'''''
 +
VC0028: FC: 0.90, P VAL: 0.0708
 +
 +
VC0941: FC: -0.28, P VAL: 0.1636
 +
 +
VC0051: FC: 1.92, P VAL: 0.0139
 +
 +
VC0647: FC: -1.11, P VAL: 0.0003
 +
 +
VC0468: FC: -0.17, P VAL: 0.3350
 +
 +
VC2350: FC: -2.40, P VAL: 0.0130
 +
 +
VCA0583: FC: 1.06, P VAL: 0.1011
 +
 +
In our analysis, these genes are not significantly changed.
 +
  
 
== GenMAPP Expression Dataset Manager Procedure ==
 
== GenMAPP Expression Dataset Manager Procedure ==
Line 148: Line 202:
 
* To see a list of the most significant Gene Ontology terms, I clicked on the menu item "Show Ranked List".   
 
* To see a list of the most significant Gene Ontology terms, I clicked on the menu item "Show Ranked List".   
 
** '''List the top 10 Gene Ontology terms in your individual journal entry.'''
 
** '''List the top 10 Gene Ontology terms in your individual journal entry.'''
* branched chain family amino acid metabolic process
+
# branched chain family amino acid metabolic process
* branched chain family amino acid biosynthetic process
+
# branched chain family amino acid biosynthetic process
* IMP metabolic process
+
# IMP metabolic process
* IMP biosynthetic process
+
# IMP biosynthetic process
* purine ribonucleoside monophosphate metabolic process
+
# purine ribonucleoside monophosphate metabolic process
* purine ribonucleoside monophosphate biosynthetic process
+
# purine ribonucleoside monophosphate biosynthetic process
* purine nucleoside monophosphate biosynthetic process
+
# purine nucleoside monophosphate biosynthetic process
* purine nucleoside monophosphate metabolic process
+
# purine nucleoside monophosphate metabolic process
* 'de novo' IMP biosynthetic process
+
# 'de novo' IMP biosynthetic process
* arginine metabolic process
+
# arginine metabolic process
  
 
** '''Compare your list with your partner who used a different version of the Gene Database.  Are your terms the same or different?  Why do you think that is?  Record your answer in your individual journal entry.'''
 
** '''Compare your list with your partner who used a different version of the Gene Database.  Are your terms the same or different?  Why do you think that is?  Record your answer in your individual journal entry.'''
* Our terms ended up looking differently from one another, and I believe this is because of the growth of the data over the year between 2009 and 2010 and possibly due to the growth of gene ontology itself.
+
Our lists show different top terms, and I think this is the case because of the fact that we are using different versions of the same database. Over the course of a year, it is entirely possible that the genes added are correlated to different fields and therefore, different terms from the gene ontology.
  
 
* One of the things you can do in MAPPFinder is to find the Gene Ontology term(s) with which a particular gene is associated. First, in the main MAPPFinder Browser window, I clicked on the button "Collapse the Tree". Then I searched for the genes that were mentioned by Merrell et al. (2002), VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. Typed the identifier for one of these genes into the MAPPFinder browser gene ID search field. Chose "OrderedLocusNames" from the drop-down menu to the right of the search field.  Clicked on the GeneID Search button.  The GO term(s) that are associated with that gene will be highlighted in blue. '''List the GO terms associated with each of those genes in your individual journal.  (Note: they might not all be found.)  Are they the same as your partner who is using a different Gene Database?  Why or why not?'''
 
* One of the things you can do in MAPPFinder is to find the Gene Ontology term(s) with which a particular gene is associated. First, in the main MAPPFinder Browser window, I clicked on the button "Collapse the Tree". Then I searched for the genes that were mentioned by Merrell et al. (2002), VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. Typed the identifier for one of these genes into the MAPPFinder browser gene ID search field. Chose "OrderedLocusNames" from the drop-down menu to the right of the search field.  Clicked on the GeneID Search button.  The GO term(s) that are associated with that gene will be highlighted in blue. '''List the GO terms associated with each of those genes in your individual journal.  (Note: they might not all be found.)  Are they the same as your partner who is using a different Gene Database?  Why or why not?'''
Line 174: Line 228:
 
*** They are not the same as my partner, probably because of the fact that he is using the older version of the gene database, and therefore has less genes with which a GO term can be associated.
 
*** They are not the same as my partner, probably because of the fact that he is using the older version of the gene database, and therefore has less genes with which a GO term can be associated.
  
* Click on one of the GO terms that are associated with one of the genes you looked up in the previous step. A MAPP will open listing all of the genes (as boxes) associated with that GO term. The genes named within the map are based on the UniProt identification system. To match the gene of interest to its identification go to the [http://www.uniprot.org/ UniProt site] and type in your gene ID into the search bar. Moreover, the genes on the MAPP will be color-coded with the gene expression data from the microarray experiment.  '''List in your journal entry the name of the GO term you clicked on and whether the expression of the gene you were looking for changed significantly in the experiment.'''
+
* Clicked on one of the GO terms that were associated with one of the genes I looked up in the previous step. A MAPP will open listing all of the genes (as boxes) associated with that GO term. The genes named within the map are based on the UniProt identification system. To match the gene of interest to its identification, I went to the [http://www.uniprot.org/ UniProt site] and typed in my gene ID into the search bar. Moreover, the genes on the MAPP will be color-coded with the gene expression data from the microarray experiment.  '''List in your journal entry the name of the GO term you clicked on and whether the expression of the gene you were looking for changed significantly in the experiment.'''
 
I clicked on the transporter activity GO term. I looked for Q9KM06 and its color was gray, which meant no criteria were met.  
 
I clicked on the transporter activity GO term. I looked for Q9KM06 and its color was gray, which meant no criteria were met.  
** Double-click on the gene box.  This will open a Internet Explorer window called the "Backpage" for this gene.  This page has links to pages for this gene in the public databases.  '''Click on the links to find out the function of this gene and record your answer in your individual journal page.'''
+
** Double-clicked on the gene box.  This will open a window called the "Backpage" for this gene.  This page has links to pages for this gene in the public databases.  '''Click on the links to find out the function of this gene and record your answer in your individual journal page.'''
 
According to the NCBI, the gene is for protein coding a hypothetical protein.
 
According to the NCBI, the gene is for protein coding a hypothetical protein.
  
** The MAPP that has just been created is stored in the directory, C:\GenMAPP 2 Data\MAPPs\VC GO.  '''Upload this file and link to it in your journal.'''
+
** "XXX" refers to the name I gave to my results file.
* In Windows, make a copy of your results (XXX-CriterionX-GO.txt) file. 
+
** "CriterionX" refers to either "Criterion0" or "Criterion1".  Since computers start counting at zero, "Criterion0" is the first criterion in the list I clicked on and "Criterion1" is the second criterion in the list I clicked on.
** "XXX" refers to the name you gave to your results file.
+
** "CriterionX" refers to either "Criterion0" or "Criterion1".  Since computers start counting at zero, "Criterion0" is the first criterion in the list you clicked on ("Increased" if you followed the directions) and "Criterion1" is the second criterion in the list you clicked on ("Decreased" if you followed the directions).
+
 
** '''Upload your results file to your journal page.'''
 
** '''Upload your results file to your journal page.'''
* Launch Microsoft Excel.  Open the copies of the .txt files in Excel (you will need to "Show all files" and click "Finish" to the wizard that will open your file).  This will show you the same data that you saw in the MAPPFinder Browser, but in tabular form.
+
* Launched Microsoft Excel.  Opened the copies of the .txt files in Excel.  This showed me the same data that I saw in the MAPPFinder Browser, but in tabular form.
* Look at the top of the spreadsheet.  There are rows of information that give you the background information on how MAPPFinder made the calculations.  '''Compare this information with your partner who used a different version of the Vibrio Gene Database.  Which numbers are different?  Why are they different?  Record this information in your individual journal entry.'''
+
* Looked at the top of the spreadsheet.  There were rows of information that gave me the background information on how MAPPFinder made the calculations.  '''Compare this information with your partner who used a different version of the Vibrio Gene Database.  Which numbers are different?  Why are they different?  Record this information in your individual journal entry.'''
* You will filter this list to show the top GO terms represented in your data for both the "Increased" and "Decreased" criteria.  You will need to filter your list down to about 20 terms.  Click on a cell in the row of headers for the data.  Then go to the Data menu and click "Filter > Autofilter".  Drop-down arrows will appear in the row of headers. You can now choose to filter the data.  Click on the drop-down arrow for the column you wish to filter and choose "(Custom…)".  A window will open giving you choices on how you want to filter.  You must set these two filters:
+
The number of probes that met the [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05 criteria, the number of probes that met the filter linked to a UniProt ID, the number of genes that met the criterion linked to a GO term, the number of probes that linked to a UniProt ID, the number of genes linked to a GO term, and N & R values used to make the z score were all different. These numbers are different because of the fact that MAPPFinder is using two different sets of data for its calculations, so the data shown in these files will be different as well.
 +
 
 +
* I filtered this list to show the top GO terms represented in my data for the "Increased" criteria.  I filtered my list down to about 20 terms.  Clicked on a cell in the row of headers for the data.  Then I went to the Data menu and clicked "Filter > Autofilter".  Drop-down arrows appeared in the row of headers. I set these two filters:
 
  Z Score (in column N) greater than 2
 
  Z Score (in column N) greater than 2
 
  PermuteP (in column O) less than 0.05
 
  PermuteP (in column O) less than 0.05
  
:You will use these two filters depending on the number of terms you have:
+
:I used these two filters:
  
 
  Number Changed (in column I) greater than or equal to 4 or 5 AND less than 100
 
  Number Changed (in column I) greater than or equal to 4 or 5 AND less than 100
 
  Percent Changed (in column L) greater than or equal to 25-50%
 
  Percent Changed (in column L) greater than or equal to 25-50%
  
* Save your changes to an Excel spreadsheet.  Select File > Save As and select Excel workbook (.xls) from the drop-down menu.  Your filter settings won’t be saved in a .txt file.
+
* Saved my changes to an Excel spreadsheet.  Selected File > Save As and select Excel workbook (.xls) from the drop-down menu.  My filter settings won’t be saved in a .txt file.
 
* '''Are any of your filtered GO terms closely related to one another, meaning are they a direct child or parent to another term in the list?  You can judge this by comparing your spreadsheet with the MAPPFinder browser.  Highlight the terms that fit this relationship with the same color in your Excel spreadsheet.  Upload your .xls file to your journal page.'''
 
* '''Are any of your filtered GO terms closely related to one another, meaning are they a direct child or parent to another term in the list?  You can judge this by comparing your spreadsheet with the MAPPFinder browser.  Highlight the terms that fit this relationship with the same color in your Excel spreadsheet.  Upload your .xls file to your journal page.'''
* '''Interpret your results.  Look up the definitions for any GO terms that are unfamiliar to you.  The "official" definitions for GO terms can be found at [http://www.geneontology.org http://www.geneontology.org].  You can use one of the online biological dictionaries as a supplement, if needed.  Write a paragraph relating the results of this GO analysis to the experiment performed (comparing laboratory-grown and patient-derived ''Vibrio cholerae''.  You need to give a biological interpretation of what do each of these GO terms in your filtered list have to to with the pathogenecity of the bacterium?  You may consult with your partner on this, but your explanation on your individual journal page needs to be in your own words.  This is where the real "brain power" comes in with interpreting DNA microarray data.  Even experienced scientists struggle with this part.  Use your creativity as a scientist to stretch your brain in this question.'''
+
Yes, most of the GO terms were related to each other, usually in pairs.
* '''There is one other file you need to save to your journal page. It has a .gmf extension and should be in the same fold as the .gex file that you created with the GenMAPP Expression Dataset Manager. You will need this file to re-open your results in MAPPFinder.'''
+
 
 +
* '''Interpret your results.  Look up the definitions for any GO terms that are unfamiliar to you.  The "official" definitions for GO terms can be found at [http://www.geneontology.org http://www.geneontology.org].  You can use one of the online biological dictionaries as a supplement, if needed.  Write a paragraph relating the results of this GO analysis to the experiment performed (comparing laboratory-grown and patient-derived ''Vibrio cholerae''.  You need to give a biological interpretation of what do each of these GO terms in your filtered list have to to with the pathogenicity of the bacterium?  You may consult with your partner on this, but your explanation on your individual journal page needs to be in your own words.  This is where the real "brain power" comes in with interpreting DNA microarray data.  Even experienced scientists struggle with this part.  Use your creativity as a scientist to stretch your brain in this question.'''
 +
 
 +
According to the Gene Ontology Database, the branched chain family amino acid metabolic process is "the chemical reactions and pathways involving amino acids containing a branched carbon skeleton, comprising isoleucine, leucine and valine." This may be significant in regards to the bacterium's pathogenicity because of the fact that a branched carbon skeleton is involved, which may be important in entering tissue.
 +
The GO Database states that the branched chain family amino acid biosynthetic process is "the chemical reactions and pathways resulting in the formation of amino acids containing a branched carbon skeleton, comprising isoleucine, leucine and valine." This process differs from the former in that it is a biosynthetic process instead of a metabolic process, which simply means this term is used to designate the end product of the metabolic process.
 +
The IMP metabolic process is "the chemical reactions and pathways involving IMP, inosine monophosphate." Because IMP is known to be associated with autoimmune diseases and immunotherapy, it makes sense that we see IMP metabolic processes here, since one of the ways a bacterium's pathogenicity is determined is by its ability to immunosuppress the host.
 +
The IMP biosynthetic process is "the chemical reactions and pathways resulting in the formation of IMP, inosine monophosphate." This biosythentic process is included because of the fact that the IMP metabolic process is involved.
 +
The purine ribonucleoside monophosphate metabolic process is "the chemical reactions and pathways involving purine ribonucleoside monophosphate, a compound consisting of a purine base linked to a ribose sugar esterified with phosphate on the sugar." This term's significance lies in its ties to purine, which is linked to complementary base pairing in both DNA and RNA. A bacterium's ability to pass on DNA and possibly hijack nutrients is vital to its pathogenicity.
 +
The purine ribonucleoside monophosphate biosynthetic process is "the chemical reactions and pathways resulting in the formation of purine ribonucleoside monophosphate, a compound consisting of a purine base linked to a ribose sugar esterified with phosphate on the sugar." This biosynthetic process is involved because of the fact that its metabolic process is included above.
 +
The purine nucleoside monophosphate biosynthetic process is "the chemical reactions and pathways resulting in the formation of purine nucleoside monophosphate, a compound consisting of a purine base linked to a ribose or deoxyribose sugar esterified with phosphate on the sugar." Once again, this process is related to nucleic acids, and is thus important in the life of a bacterium, whose sole purpose is to survive within an organism.
 +
The purine nucleoside monophosphate metabolic process is "the chemical reactions and pathways involving purine nucleoside monophosphate, a compound consisting of a purine base linked to a ribose or deoxyribose sugar esterified with phosphate on the sugar."
 +
This is the metabolic process that corresponds to the process described above.
 +
The ’de novo’ IMP biosynthetic process is "the chemical reactions and pathways resulting in the formation of IMP, inosine monophosphate, by the stepwise assembly of a purine ring on ribose 5-phosphate."
 +
This process brings together two of the concepts brought up earlier, where IMP, which is related to immunosuppression, and purine, the organic compound that is seen in base pairing. It seems to me that this is the process that enables the bacterium to change some parts of the host's DNA so that an immunosuppressant is created.
 +
The arginine metabolic process is "the chemical reactions and pathways involving arginine, 2-amino-5-(carbamimidamido)pentanoic acid."
 +
This process seems to be focused on arginine, one of the essential amino acids in humans. One possible reason why this process is relevant could be due to the fact that the bacterium relies on the body's reliance on this amino acid against it.
 +
The ribonucleoside monophosphate metabolic process is "the chemical reactions and pathways involving a ribonucleoside monophosphate, a compound consisting of a nucleobase linked to a ribose sugar esterified with phosphate on the sugar."
 +
This is another process that is related to the DNA of the subject as well as base pairing. It seems to me that this bacterium is highly involved in the altering of DNA, whether it alters its host's DNA or its own is unknown to me.
 +
The ribonucleoside monophosphate biosynthetic process is "the chemical reactions and pathways resulting in the formation of a ribonucleoside monophosphate, a compound consisting of a nucleobase linked to a ribose sugar esterified with phosphate on the sugar."
 +
This is the biosynthetic process that correlates to the metabolic process described above.
 +
The glutamine family amino acid metabolic process is "the chemical reactions and pathways involving amino acids of the glutamine family, comprising arginine, glutamate, glutamine and proline."
 +
It looks like this process is related to the arginine metabolic process, and it is also highly focused on glutamine, another essential amino acid. This amino acid is involved in many processes, including protein synthesis and cellular energy. This leads me to believe that this bacterium also focuses on the processes for which glutamine is vital.
 +
The purine nucleotide biosynthetic process is "the chemical reactions and pathways resulting in the formation of a purine nucleotide, a compound consisting of nucleoside (a purine base linked to a deoxyribose or ribose sugar) esterified with a phosphate group at either the 3' or 5'-hydroxyl group of the sugar." The purine nucleotide metabolic process is "the chemical reactions and pathways involving a purine nucleotide, a compound consisting of nucleoside (a purine base linked to a deoxyribose or ribose sugar) esterified with a phosphate group at either the 3' or 5'-hydroxyl group of the sugar." These two processes seem to be involved in events at the nucleotide level, since we have already talked about purine, it makes sense that these two processes are just as significant for this bacterium.
 +
The arginine biosynthetic process is "the chemical reactions and pathways resulting in the formation of arginine, 2-amino-5-(carbamimidamido)pentanoic acid."
 +
This process is directly related to the arginine metabolic process, and simply describes the product.
 +
Cell projection organization is "a process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of a prolongation or process extending from a cell, e.g. a flagellum or axon."
 +
It wouldn't surprise me if many kinds of bacterium found cell projection organization significant, because of the fact that it is so important in the continued survival of the cell.
 +
The histidine biosynthetic process is "the chemical reactions and pathways resulting in the formation of histidine, 2-amino-3-(1H-imidazol-4-yl)propanoic acid."
 +
Histidine is yet another essential amino acid for humans, which further leads me to believe that this bacterium relies on many amino acids that are vital for human life.
 +
The glutamine family amino acid biosynthetic process is "the chemical reactions and pathways resulting in the formation of amino acids of the glutamine family, comprising arginine, glutamate, glutamine and proline."
 +
This is the biosynthetic process that goes along with the metabolic process described earlier.
  
 
=== Conclusion ===
 
=== Conclusion ===
  
* Write a paragraph that briefly summarizes and gives a scientific conclusion for the work that you did for part 1 and 2 this week.
+
After going through some of the steps that biologists go through for a microarray data analysis, I now have a better understanding of how DNA microarray data is analyzed and what conclusions one can draw from this analysis. By analyzing data for which the "answers" were already known, we were able to check our own hypotheses and calculations with those done by Merrell et. al. We first performed some statistical analysis on the data in order to prepare our spreadsheet for use in MAPPFinder. By using this program we were able to find the relevant gene ontology terms and see what these terms meant for the bacterium we were studying. Overall, this experience has taught me how to efficiently use microarray data to draw conclusions about an organism that would have otherwise been overlooked.
  
 
{{Template:Journal Template}}
 
{{Template:Journal Template}}

Latest revision as of 04:13, 27 October 2015

zip file with most of the required files

colored criterion xlsx file

updated spreadsheet

Normalized the log ratios for the set of slides in the experiment

To scale and center the data (between chip normalization) I performed the following operations:

  • Inserted a new Worksheet into my Excel file, and named it "scaled_centered".
  • Went back to the "compiled_raw_data" worksheet, Selected All and Copy. Went to my new "scaled_centered" worksheet, clicked on the upper, left-hand cell (cell A1) and Pasted.
  • Inserted two rows in between the top row of headers and the first data row.
  • In cell A2, typed "Average" and in cell A3, typed "StdDev".
  • I computed the Average log ratio for each chip (each column of data). In cell B2, I typed the following equation:
=AVERAGE(B4:B5224)
and pressed "Enter". Excel computed the average value of the cells specified in the range given inside the parentheses. Instead of typing the cell designations, I clicked on the beginning cell, scrolled down to the bottom of the worksheet, and shift-clicked on the ending cell.
  • I then computed the Standard Deviation of the log ratios on each chip (each column of data). In cell B3, I typed the following equation:
=STDEV(B4:B5224)
and pressed "Enter".
  • Excel then did some of the work for me. I copied these two equations (cells B2 and B3) and pasted them into the empty cells in the rest of the columns. Excel automatically changed the equation to match the cell designations for those columns.
  • I have now computed the average and standard deviation of the log ratios for each chip. Now I actually did the scaling and centering based on these values.
  • I copied the column headings for all of my data columns and pasted them to the right of the last data column so that I had a second set of headers above the blank columns of cells. I edited the names of the columns so that they read as: A1_scaled_centered, A2_scaled_centered, etc.
  • In cell N4, I typed the following equation:
=(B4-B$2)/B$3
In this case, I wanted the data in cell B4 to have the average subtracted from it (cell B2) and be divided by the standard deviation (cell B3). I used the dollar sign symbols in front of the "2" and "3" to tell Excel to always reference that row in the equation, even though I pasted it for the entire column of 5221 genes.
  • I copied and pasted this equation into the entire column.
  • I then copied and pasted the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header.

Performed statistical analysis on the ratios

I performed this step on the scaled and centered data I produced in the previous step.

  • Inserted a new worksheet and named it "statistics".
  • Went back to the "scaling_centering" worksheet and copied the first column ("ID").
  • Pasted the data into the first column of my "statistics" worksheet.
  • Went back to the "scaling_centering" worksheet and coped the columns that were designated "_scaled_centered".
  • Went to my new worksheet and clicked on the B1 cell. Selected "Paste Special" from the Edit menu. A window opened: clicked on the button for "Values" and clicked OK. This pasted the numerical result into my new worksheet instead of the equation which must make calculations on the fly.
  • Deleted Rows 2 and 3 where it said "Average" and "StDev" so that my data rows with gene IDs were immediately below the header row 1.
  • Went to a new column on the right of my worksheet. Typed the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns.
  • Computed the average log fold change for the replicates for each patient by typing the equation:
=AVERAGE(B2:E2) into cell N2.  Copied this equation and pasted it into the rest of the column.  
  • Created the equation for patients B and C and pasted it into their respective columns.
  • Then I computed the average of the averages. Typed the header "Avg_LogFC_all" into the first cell in the next empty column. Created the equation that will compute the average of the three previous averages I calculated and pasted it into this entire column.
  • Inserted a new column next to the "Avg_LogFC_all" column that I computed in the previous step. Labeled the column "Tstat". This computed a T statistic that told me whether the scaled and centered average log ratio was significantly different than 0 (no change). Entered the equation:
=AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(3))

Copied the equation and pasted it into all rows in that column.

  • Labeled the top cell in the next column "Pvalue". In the cell below the label, entered the equation:
=TDIST(ABS(R2),2,2)

Copied the equation and pasted it into all rows in that column.

Calculated the Bonferroni p value Correction

  • Next I performed adjustments to the p value to correct for the multiple testing problem. Labeled the next two columns to the right with the same label, Bonferroni_Pvalue.
  • Typed the equation =S2*5221, Upon completion of this single computation, used the trick to copy the formula throughout the column.
  • Replaced all corrected p values that were greater than 1 by the number 1 by typing the following formula into the first cell below the second Bonferroni_Pvalue header: =IF(T2>1,1,T2). Used the trick to copy the formula throughout the column.

Calculated the Benjamini & Hochberg p value Correction

  • Inserted a new worksheet named "B-H_Pvalue".
  • Copied and pasted the "ID" column from my previous worksheet into the first column of the new worksheet.
  • Inserted a new column on the very left and named it "MasterIndex". I created a numerical index of genes so that I can always sort them back into the same order.
    • Typed a "1" in cell A2 and a "2" in cell A3.
    • Selected both cells and filled the entire column with a series of numbers from 1 to 5221.
  • Copied my unadjusted p values from my previous worksheet and pasted it into Column C.
  • Selected all of columns A, B, and C. Sorted by ascending values on Column C. Clicked the sort button from A to Z on the toolbar and sorted by column C, smallest to largest.
  • Typed the header "Rank" in cell D1. I created a series of numbers in ascending order from 1 to 5221 in this column. Typed "1" into cell D2 and "2" into cell D3. Selected both cells D2 and D3 and filled the column with a series of numbers from 1 to 5221.
  • Then I calculated the Benjamini and Hochberg p value correction by typing B-H_Pvalue in cell E1. Typed the following formula in cell E2: =(C2*5221)/D2 and pressed enter. Copied that equation to the entire column.
  • Typed "B-H_Pvalue" into cell F1.
  • Typed the following formula into cell F2: =IF(E2>1,1,E2) and pressed enter. Copied that equation to the entire column.
  • Selected columns A through F. Then I sorted them by my MasterIndex in Column A in ascending order.
  • Copied column F and pasted it into the next column on the right of my "statistics" sheet.

Prepared file for GenMAPP

  • Inserted a new worksheet and named it "forGenMAPP".
  • Went back to the "statistics" worksheet and Selected All and Copied.
  • Went to my new sheet and clicked on cell A1 and selected Paste Special, clicked on the Values button, and clicked OK. I then formated this worksheet for import into GenMAPP and:
  • Selected Columns B through Q (all the fold changes). Selected the menu item Format > Cells. Under the number tab, selected 2 decimal places. Clicked OK.
  • Selected all the columns containing p values. Selected the menu item Format > Cells. Under the number tab, selected 4 decimal places. Clicked OK.
  • Deleted the left-most Bonferroni p value column, thus preserving the one that shows the result of my "if" statement.
  • Inserted a column to the right of the "ID" column. Typed the header "SystemCode" into the top cell of this column. Filled the entire column with the letter "N".
  • Selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu. My new *.txt file is now ready for import into GenMAPP.
    • Uploaded both the .xls and .txt files that you have just created to my journal page in the class wiki.

Sanity Checked: Number of genes significantly changed

Before I moved on to the GenMAPP/MAPPFinder analysis, I performed a sanity check to make sure that I performed my data analysis correctly. I found out the number of genes that were significantly changed at various p value cut-offs and also compared my data analysis with the published results of Merrell et al. (2002).

  • Opened my spreadsheet and went to the "forGenMAPP" tab.
  • Clicked on cell A1 and selected the menu item Data > Filter > Autofilter.
  • Clicked on the drop-down arrow on my "Pvalue" column. Selected "Custom". In the window that appeared, I set a criterion that will filter my data so that the Pvalue has to be less than 0.05.
    • How many genes have p value < 0.05? and what is the percentage (out of 5221)?

948 genes for a percentage of ~18%

    • What about p < 0.01? and what is the percentage (out of 5221)?

235 genes for a percentage of ~5%

    • What about p < 0.001? and what is the percentage (out of 5221)?

24 genes for a percentage of ~0.5%

    • What about p < 0.0001? and what is the percentage (out of 5221)?

2 genes for a percentage of ~0.05%

  • When I use a p value cut-off of p < 0.05, what I am saying is that I would have seen a gene expression change that deviates this far from zero less than 5% of the time.
  • I have just performed 5221 T tests for significance. Another way to state what I am seeing with p < 0.05 is that I would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. Since I have more than 261 genes that pass this cut off, I know that some genes are significantly changed. However, I don't know which ones. To apply a more stringent criterion to my p values, I performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, I filtered my data to determine the following:
    • How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 5221)?

0 genes for a percentage of ~0%

    • How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 5221)?

0 genes for a percentage of ~0%

  • In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a movable confidence level. If I want to be very confident of our data, I would use a small p value cut-off. If I am OK with being less confident about a gene expression change and want to include more genes in my analysis, I can use a larger p value cut-off.
  • The "Avg_LogFC_all" tells me the size of the gene expression change and in which direction. Positive values increases relative to the control; negative values decreases relative to the control.
    • Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there? (and %)

352 genes for a percentage of ~7%

    • Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero. How many are there? (and %)

596 genes for a percentage of ~11%

    • What about an average log fold change of > 0.25 and p < 0.05? (and %)

339 genes for a percentage of ~6%

    • Or an average log fold change of < -0.25 and p < 0.05? (and %) (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)

579 genes for a percentage of ~11%

  • For the GenMAPP analysis below, I will use the fold change cut-off of greater than 0.25 or less than -0.25 and the unadjusted p value cut off of p < 0.05 for my analysis because we want to include several hundred genes in my analysis.
  • What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method?

According to Merrell et al. (2002): "A two-class SAM analysis was conducted using the strain grown in vitro as class I, and each individual stool sample as class II. Genes with statistically significant changes in the level of expression—at least a twofold change—in each patient sample were chosen, and the derived data from individual stool samples were collapsed to identify genes that were differentially regulated in all three samples." This is different from our method in that they did not seem to use p-values to determine significance, where we used solely p-values.

Sanity Checked: Compared individual genes with known data

  • Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. Look these genes up in your spreadsheet. What are their fold changes and p values? Are they significantly changed in our analysis?

VC0028: FC: 0.90, P VAL: 0.0708

VC0941: FC: -0.28, P VAL: 0.1636

VC0051: FC: 1.92, P VAL: 0.0139

VC0647: FC: -1.11, P VAL: 0.0003

VC0468: FC: -0.17, P VAL: 0.3350

VC2350: FC: -2.40, P VAL: 0.0130

VCA0583: FC: 1.06, P VAL: 0.1011

In our analysis, these genes are not significantly changed.


GenMAPP Expression Dataset Manager Procedure

  • Launched the GenMAPP Program. Checked to make sure the correct Gene Database is loaded.
    • Looked in the lower, left-hand corner of the main GenMAPP Drafting Board window to see the name of the Gene Database that was loaded.
    • I used the 2010 version.
  • Selected the Data menu from the main Drafting Board window and chose Expression Dataset Manager from the drop-down list. The Expression Dataset Manager window opened.
  • Selected New Dataset from the Expression Datasets menu. Selected the tab-delimited text file that I formatted for GenMAPP (.txt) in the procedure above from the file dialog box that appears.
    • The Vibrio data I have been working with does not have any text (character) data in it.
  • Allowed the Expression Dataset Manager to convert your data.
    • When the process is complete, the converted dataset was active in the Expression Dataset Manager window and the file was saved in the same folder the raw data file was in, named the same except with a .gex extension.
    • Lines that generated an error during the conversion of a raw data file were not added to the Expression Dataset. Instead, an exception file was created. The exception file was given the same name as my raw data file with .EX before the extension. The exception file contains all of my raw data, with the addition of a column named ~Error~. This column contains either error messages or, if the program finds no errors, a single space character.
      • Recorded the number of errors, in my case, 121 errors. For my journal assignment, I opened the .EX.txt file and used the Data > Filter > Autofilter function to determine what the errors were for the rows that were not converted:

Errors go here!!!!

      • I got a different number of errors than my partner, who is using a different version of the Vibrio cholerae Gene Database. He got 772 errors while I got 121 errors.

Why do you think that is?

  • I would think that this is the case because of the fact that I used the more recent, 2010 version, while he used the 2009 version, which would have been less extensive and apparently had less data.
      • Uploaded my exceptions file: EX.txt to my wiki page.
  • Customized the new Expression Dataset by creating new Color Sets which contain the instructions to GenMAPP for displaying data on MAPPs.
    • Created a Color Set by filling in the following different fields in the Color Set area of the Expression Dataset Manager: a name for the Color Set, the gene value, and the criteria that determine how a gene object is colored on the MAPP. Entered a name in the Color Set Name field, i.e. LogFoldChange.
    • The Gene Value was the data displayed next to the gene box on a MAPP. Selected the column of data to be used as the Gene Value from the drop down list or select [none]. I used "Avg_LogFC_all" for the Vibrio dataset I just created.
    • Activated the Criteria Builder by clicking the New button.
    • Entered a name for the criterion in the Label in Legend field.
    • Chose a color for the criterion by left-clicking on the Color box. Chose a color from the Color window that appears and clicked OK.
    • Stated the criterion for color-coding a gene in the Criterion field.
      • A criterion is stated with relationships such as "this column greater than this value" or "that column less than or equal to that value". Individual relationships can be combined using as many ANDs and ORs as needed. A typical relationship is
[ColumnName] RelationalOperator Value
with the column name always enclosed in brackets and character values enclosed in single quotes. For example:
[Fold Change] >= 2
[p value] < 0.05
[Quality] = 'high'
This is the equivalent to queries that I performed on the command line when working with the PostgreSQL movie database. GenMAPP is using a graphical user interface (GUI) to help the user format the queries correctly. The easiest and safest way to create criteria is by choosing items from the Columns and Ops (operators) lists shown in the Criteria Builder. The Columns list contains all of the column headings from your Expression Dataset. To choose a column from the list, I clicked on the column heading. It appeared at the location of the cursor in the Criterion box. The Criteria Builder surrounded the column names with brackets.
The Ops (operators) list contains the relational operators that may be used in the criteria: equals ( = ) greater than ( > ), less than ( < ), greater than or equal to ( >= ), less than or equal to ( <= ), is not equal to ( <> ). To choose an operator from the list, click on the symbol. It will appear at the location of the insertion bar (cursor) in the Criterion box. The Criteria Builder automatically surrounds the operators with spaces.
The Ops list also contains the conjunctions AND and OR, which may be used to make compound criteria. For example:
[Fold Change] > 1.2 AND [p value] <= 0.05
Parentheses control the order of evaluation. Anything in parentheses is evaluated first. Parentheses may be nested. For example:
[Control Average] = 100 AND ([Exp1 Average] > 100 OR [Exp2 Average] > 100)
Column names may be used anywhere a value can, for example:
[Control Average] < [Experiment Average]
  • After completing a new criterion, I added the criterion entry (label, criterion, and color) to the Criteria List by clicking the Add button.
    • For the Vibrio dataset, I created two criterion. "Increased" is [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05 and "Decreased will be [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05.
      • The buttons to the right of the list represent actions that can be performed on individual criteria. To modify a criterion label, color, or the criterion itself, I first selected the criterion in the list by left-clicking on it, and clicked the Edit button. This puts the selected criterion into the Criteria Builder to be modified. Clicked the Save button to save changes to the modified criterion; clicked the Add button to add it to the list as a separate criterion. To remove a criterion from the list, I left-clicked on the criterion to select it, and clicked on the Delete button. The order of Criteria in the list has significance to GenMAPP. When applying an Expression Dataset and Color Set to a MAPP, GenMAPP examines the expression data for a particular gene object and applies the color for the first criterion in the list that is true. Therefore, it is imperative that when criteria overlap the user put the most important or least inclusive criteria in the list first. To change the order of the criteria in the list, I left-clicked on the criterion to select it and clicked the Move Up or Move Down buttons. No criteria met and Not found are always the last two positions in the list.
  • Saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. Changes made to a Color Set are not saved until I do this.
  • Exited the Expression Dataset Manager to view the Color Sets on a MAPP. Chose Exit from the Expression Dataset menu.
  • Uploaded my .gex file to your journal entry page for later retrieval.

MAPPFinder Procedure

  • Launched the MAPPFinder program.
  • Made sure that the Gene Database for the correct species was loaded. The name of the Gene Database appeared at the bottom of the window.
  • Clicked on the button "Calculate New Results".
  • Clicked on "Find File" and chose the my Expression Dataset file, and clicked OK.
  • Chose the Color Set and Criteria with which to filter the data. Clicked on the "Increased" criteria in the right-hand box.
  • Checked the boxes next to "Gene Ontology" and "p value".
  • Clicked the "Browse" button and created a meaningful filename for myresults.
  • Clicked "Run MAPPFinder". The analysis took several minutes.
  • When the results have been calculated, a Gene Ontology browser opened showing myresults. All of the Gene Ontology terms that have at least 3 genes measured and a p value of less than 0.05 were highlighted yellow. A term with a p value less than 0.05 is considered a "significant" result. Browsed through the tree to see myresults.
  • To see a list of the most significant Gene Ontology terms, I clicked on the menu item "Show Ranked List".
    • List the top 10 Gene Ontology terms in your individual journal entry.
  1. branched chain family amino acid metabolic process
  2. branched chain family amino acid biosynthetic process
  3. IMP metabolic process
  4. IMP biosynthetic process
  5. purine ribonucleoside monophosphate metabolic process
  6. purine ribonucleoside monophosphate biosynthetic process
  7. purine nucleoside monophosphate biosynthetic process
  8. purine nucleoside monophosphate metabolic process
  9. 'de novo' IMP biosynthetic process
  10. arginine metabolic process
    • Compare your list with your partner who used a different version of the Gene Database. Are your terms the same or different? Why do you think that is? Record your answer in your individual journal entry.

Our lists show different top terms, and I think this is the case because of the fact that we are using different versions of the same database. Over the course of a year, it is entirely possible that the genes added are correlated to different fields and therefore, different terms from the gene ontology.

  • One of the things you can do in MAPPFinder is to find the Gene Ontology term(s) with which a particular gene is associated. First, in the main MAPPFinder Browser window, I clicked on the button "Collapse the Tree". Then I searched for the genes that were mentioned by Merrell et al. (2002), VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. Typed the identifier for one of these genes into the MAPPFinder browser gene ID search field. Chose "OrderedLocusNames" from the drop-down menu to the right of the search field. Clicked on the GeneID Search button. The GO term(s) that are associated with that gene will be highlighted in blue. List the GO terms associated with each of those genes in your individual journal. (Note: they might not all be found.) Are they the same as your partner who is using a different Gene Database? Why or why not?
    • VC0028: BRANCHED CHAIN FAMILY AMINO ACID BIOSYNTHETIC PROCESS, CELLULAR AMINO ACID BIOSYNTHETIC PROCESS, METABOLIC PROCESS, METAL ION BINDING, IRON-SULFUR CLUSTER BINDING, 4 IRON, 4 SULFUR CLUSTER BINDING, CATALYTIC ACTIVITY, LYASE ACTIVITY, DIHYDROXY-ACID DEHYDRATASE ACTIVITY
    • VC0491: NONE
    • VC0869: GLUTAMINE METABOLIC PROCESS, PURINE NUCLEOTIDE BIOSYNTHETIC PROCESS, 'DE NOVO' IMP BIOSYNTHETIC PROCESS, CYTOPLASM, NUCLEOTIDE BINDING, ATP BINDING, CATALYTIC ACTIVITY, LIGASE ACTIVITY, PHOSPHORIBOSYLFORMYLGLYCINAMIDINE SYNTHASE ACTIVITY
    • VC0051: PURINE NUCLEOTIDE BIOSYNTHETIC PROCESS, 'DE NOVO' IMP BIOSYNTHETIC PROCESS, NUCLEOTIDE BINDING, ATP BINDING, CATALYTIC BINDING, LYASE ACTIVITY, CARBOXY-LYASE ACTIVITY, PHOSPHORIBOSYLAMINOIMIDAZOLE CARBOXYLASE ACTIVITY
    • VC0647: MRNA CATABOLIC PROCESS, RNA PROCESSING, CYTOPLASM, MITOCHONDRION, RNA BINDING, 3'-5' EXORIBONUCLEASE ACTIVITY, TRANSFERASE ACTIVITY, NUCLEOTIDYLTRANSFERASE ACTIVITY, POLYRIBONUCLEOTIDE NUCLEOTIDYLTRANSFERASE ACTIVITY
    • VC0468: GLUTHATHIONE BIOSYNTHETIC PROCESS, METAL ION BINDING, NUCLEOTIDE BINDING, ATP BINDNIG, CATALYTIC ACTIVITY, LIGASE ACTIVITY, GLUTHATHIONE SYNTHASE ACTIVITY
    • VC2350: DEOXYRIBONUCLEOTIDE CATABOLIC PROCESS, METABOLIC PROCESS, CYTOPLASM, CATALYTIC ACTIVITY, DEOXYRIBOSE-PHOSPHATE ALDOLASE ACTIVITY
    • VCA0583: TRANSPORT, OUTER MEMBRANE-BOUNDED PERIPLASMIC SPACE, TRANSPORTER ACTIVITY
      • They are not the same as my partner, probably because of the fact that he is using the older version of the gene database, and therefore has less genes with which a GO term can be associated.
  • Clicked on one of the GO terms that were associated with one of the genes I looked up in the previous step. A MAPP will open listing all of the genes (as boxes) associated with that GO term. The genes named within the map are based on the UniProt identification system. To match the gene of interest to its identification, I went to the UniProt site and typed in my gene ID into the search bar. Moreover, the genes on the MAPP will be color-coded with the gene expression data from the microarray experiment. List in your journal entry the name of the GO term you clicked on and whether the expression of the gene you were looking for changed significantly in the experiment.

I clicked on the transporter activity GO term. I looked for Q9KM06 and its color was gray, which meant no criteria were met.

    • Double-clicked on the gene box. This will open a window called the "Backpage" for this gene. This page has links to pages for this gene in the public databases. Click on the links to find out the function of this gene and record your answer in your individual journal page.

According to the NCBI, the gene is for protein coding a hypothetical protein.

    • "XXX" refers to the name I gave to my results file.
    • "CriterionX" refers to either "Criterion0" or "Criterion1". Since computers start counting at zero, "Criterion0" is the first criterion in the list I clicked on and "Criterion1" is the second criterion in the list I clicked on.
    • Upload your results file to your journal page.
  • Launched Microsoft Excel. Opened the copies of the .txt files in Excel. This showed me the same data that I saw in the MAPPFinder Browser, but in tabular form.
  • Looked at the top of the spreadsheet. There were rows of information that gave me the background information on how MAPPFinder made the calculations. Compare this information with your partner who used a different version of the Vibrio Gene Database. Which numbers are different? Why are they different? Record this information in your individual journal entry.

The number of probes that met the [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05 criteria, the number of probes that met the filter linked to a UniProt ID, the number of genes that met the criterion linked to a GO term, the number of probes that linked to a UniProt ID, the number of genes linked to a GO term, and N & R values used to make the z score were all different. These numbers are different because of the fact that MAPPFinder is using two different sets of data for its calculations, so the data shown in these files will be different as well.

  • I filtered this list to show the top GO terms represented in my data for the "Increased" criteria. I filtered my list down to about 20 terms. Clicked on a cell in the row of headers for the data. Then I went to the Data menu and clicked "Filter > Autofilter". Drop-down arrows appeared in the row of headers. I set these two filters:
Z Score (in column N) greater than 2
PermuteP (in column O) less than 0.05
I used these two filters:
Number Changed (in column I) greater than or equal to 4 or 5 AND less than 100
Percent Changed (in column L) greater than or equal to 25-50%
  • Saved my changes to an Excel spreadsheet. Selected File > Save As and select Excel workbook (.xls) from the drop-down menu. My filter settings won’t be saved in a .txt file.
  • Are any of your filtered GO terms closely related to one another, meaning are they a direct child or parent to another term in the list? You can judge this by comparing your spreadsheet with the MAPPFinder browser. Highlight the terms that fit this relationship with the same color in your Excel spreadsheet. Upload your .xls file to your journal page.

Yes, most of the GO terms were related to each other, usually in pairs.

  • Interpret your results. Look up the definitions for any GO terms that are unfamiliar to you. The "official" definitions for GO terms can be found at http://www.geneontology.org. You can use one of the online biological dictionaries as a supplement, if needed. Write a paragraph relating the results of this GO analysis to the experiment performed (comparing laboratory-grown and patient-derived Vibrio cholerae. You need to give a biological interpretation of what do each of these GO terms in your filtered list have to to with the pathogenicity of the bacterium? You may consult with your partner on this, but your explanation on your individual journal page needs to be in your own words. This is where the real "brain power" comes in with interpreting DNA microarray data. Even experienced scientists struggle with this part. Use your creativity as a scientist to stretch your brain in this question.

According to the Gene Ontology Database, the branched chain family amino acid metabolic process is "the chemical reactions and pathways involving amino acids containing a branched carbon skeleton, comprising isoleucine, leucine and valine." This may be significant in regards to the bacterium's pathogenicity because of the fact that a branched carbon skeleton is involved, which may be important in entering tissue. The GO Database states that the branched chain family amino acid biosynthetic process is "the chemical reactions and pathways resulting in the formation of amino acids containing a branched carbon skeleton, comprising isoleucine, leucine and valine." This process differs from the former in that it is a biosynthetic process instead of a metabolic process, which simply means this term is used to designate the end product of the metabolic process. The IMP metabolic process is "the chemical reactions and pathways involving IMP, inosine monophosphate." Because IMP is known to be associated with autoimmune diseases and immunotherapy, it makes sense that we see IMP metabolic processes here, since one of the ways a bacterium's pathogenicity is determined is by its ability to immunosuppress the host. The IMP biosynthetic process is "the chemical reactions and pathways resulting in the formation of IMP, inosine monophosphate." This biosythentic process is included because of the fact that the IMP metabolic process is involved. The purine ribonucleoside monophosphate metabolic process is "the chemical reactions and pathways involving purine ribonucleoside monophosphate, a compound consisting of a purine base linked to a ribose sugar esterified with phosphate on the sugar." This term's significance lies in its ties to purine, which is linked to complementary base pairing in both DNA and RNA. A bacterium's ability to pass on DNA and possibly hijack nutrients is vital to its pathogenicity. The purine ribonucleoside monophosphate biosynthetic process is "the chemical reactions and pathways resulting in the formation of purine ribonucleoside monophosphate, a compound consisting of a purine base linked to a ribose sugar esterified with phosphate on the sugar." This biosynthetic process is involved because of the fact that its metabolic process is included above. The purine nucleoside monophosphate biosynthetic process is "the chemical reactions and pathways resulting in the formation of purine nucleoside monophosphate, a compound consisting of a purine base linked to a ribose or deoxyribose sugar esterified with phosphate on the sugar." Once again, this process is related to nucleic acids, and is thus important in the life of a bacterium, whose sole purpose is to survive within an organism. The purine nucleoside monophosphate metabolic process is "the chemical reactions and pathways involving purine nucleoside monophosphate, a compound consisting of a purine base linked to a ribose or deoxyribose sugar esterified with phosphate on the sugar." This is the metabolic process that corresponds to the process described above. The ’de novo’ IMP biosynthetic process is "the chemical reactions and pathways resulting in the formation of IMP, inosine monophosphate, by the stepwise assembly of a purine ring on ribose 5-phosphate." This process brings together two of the concepts brought up earlier, where IMP, which is related to immunosuppression, and purine, the organic compound that is seen in base pairing. It seems to me that this is the process that enables the bacterium to change some parts of the host's DNA so that an immunosuppressant is created. The arginine metabolic process is "the chemical reactions and pathways involving arginine, 2-amino-5-(carbamimidamido)pentanoic acid." This process seems to be focused on arginine, one of the essential amino acids in humans. One possible reason why this process is relevant could be due to the fact that the bacterium relies on the body's reliance on this amino acid against it. The ribonucleoside monophosphate metabolic process is "the chemical reactions and pathways involving a ribonucleoside monophosphate, a compound consisting of a nucleobase linked to a ribose sugar esterified with phosphate on the sugar." This is another process that is related to the DNA of the subject as well as base pairing. It seems to me that this bacterium is highly involved in the altering of DNA, whether it alters its host's DNA or its own is unknown to me. The ribonucleoside monophosphate biosynthetic process is "the chemical reactions and pathways resulting in the formation of a ribonucleoside monophosphate, a compound consisting of a nucleobase linked to a ribose sugar esterified with phosphate on the sugar." This is the biosynthetic process that correlates to the metabolic process described above. The glutamine family amino acid metabolic process is "the chemical reactions and pathways involving amino acids of the glutamine family, comprising arginine, glutamate, glutamine and proline." It looks like this process is related to the arginine metabolic process, and it is also highly focused on glutamine, another essential amino acid. This amino acid is involved in many processes, including protein synthesis and cellular energy. This leads me to believe that this bacterium also focuses on the processes for which glutamine is vital. The purine nucleotide biosynthetic process is "the chemical reactions and pathways resulting in the formation of a purine nucleotide, a compound consisting of nucleoside (a purine base linked to a deoxyribose or ribose sugar) esterified with a phosphate group at either the 3' or 5'-hydroxyl group of the sugar." The purine nucleotide metabolic process is "the chemical reactions and pathways involving a purine nucleotide, a compound consisting of nucleoside (a purine base linked to a deoxyribose or ribose sugar) esterified with a phosphate group at either the 3' or 5'-hydroxyl group of the sugar." These two processes seem to be involved in events at the nucleotide level, since we have already talked about purine, it makes sense that these two processes are just as significant for this bacterium. The arginine biosynthetic process is "the chemical reactions and pathways resulting in the formation of arginine, 2-amino-5-(carbamimidamido)pentanoic acid." This process is directly related to the arginine metabolic process, and simply describes the product. Cell projection organization is "a process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of a prolongation or process extending from a cell, e.g. a flagellum or axon." It wouldn't surprise me if many kinds of bacterium found cell projection organization significant, because of the fact that it is so important in the continued survival of the cell. The histidine biosynthetic process is "the chemical reactions and pathways resulting in the formation of histidine, 2-amino-3-(1H-imidazol-4-yl)propanoic acid." Histidine is yet another essential amino acid for humans, which further leads me to believe that this bacterium relies on many amino acids that are vital for human life. The glutamine family amino acid biosynthetic process is "the chemical reactions and pathways resulting in the formation of amino acids of the glutamine family, comprising arginine, glutamate, glutamine and proline." This is the biosynthetic process that goes along with the metabolic process described earlier.

Conclusion

After going through some of the steps that biologists go through for a microarray data analysis, I now have a better understanding of how DNA microarray data is analyzed and what conclusions one can draw from this analysis. By analyzing data for which the "answers" were already known, we were able to check our own hypotheses and calculations with those done by Merrell et. al. We first performed some statistical analysis on the data in order to prepare our spreadsheet for use in MAPPFinder. By using this program we were able to find the relevant gene ontology terms and see what these terms meant for the bacterium we were studying. Overall, this experience has taught me how to efficiently use microarray data to draw conclusions about an organism that would have otherwise been overlooked.

Josh Kuroda's page

Individual Journal Entries

Week 2
Week 3
Week 4
Week 5
Week 6
Week 7
Week 8
Week 9
Week 10
Week 11
Week 12
Week 13
Week 14
Week 15

Shared Journal Entries

Week 1
Week 2
Week 3
Week 4
Week 5
Week 6
Week 7
Week 8
Week 9
Week 10
Week 11
Week 12
Week 13
Week 14
Week 15