Difference between revisions of "Troque Week 8"
(Starting the Individual Journal) |
m (Changed "Media" into "Image") |
||
(41 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
{{Template:Troque}} | {{Template:Troque}} | ||
− | == | + | == Sources == |
− | + | * The methods described in Part 1 of this page are taken from this [http://www.openwetware.org/wiki/BIOL398-01/S10:Sample_Microarray_Analysis_Vibrio_cholerae openwetware page]. | |
+ | * The methods for Part 2 have been adapted from this [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols page]. | ||
− | * | + | == Files created == |
− | * | + | * '''The Excel file created on Thursday (October 15, 2015) can be downloaded [[Media:Merrell Compiled Raw Data Vibrio TR 20151015.xls | here]].''' |
− | * | + | * '''A more updated Excel file with the B-H p-value correction can be downloaded [[Media:Merrell Compiled Raw Data Vibrio TR 20151019.xls | here]].''' |
− | * In cell A2, | + | * '''The tab delimited txt file can be seen [[Media:Merrell Compiled Raw Data Vibrio TR 20151019.txt | here]] |
− | * | + | |
+ | == Things to note == | ||
+ | * Always save your work when you have a chance. | ||
+ | * For this assignment, my partner was Erich Yanoschik. | ||
+ | * For Part 2, I worked on the 2009 data while Erich worked on the 2010 data. | ||
+ | * On Thursday (October 22), we were assigned to analyze decreased expressions of our data using GenMAPP. | ||
+ | * I met with Erich on Monday (October 26) to work on this assignment. | ||
+ | |||
+ | == Part 1 == | ||
+ | |||
+ | === Normalize the log ratios for the set of slides in the experiment === | ||
+ | To scale and center the data (between chip normalization) I performed the following operations: | ||
+ | |||
+ | * Inserted a new Worksheet into my Excel file, and named it "scaled_centered". | ||
+ | * Selected all and copied and pasted everything from the "compiled_raw_data" worksheet into this new "scaled_centered". | ||
+ | * Inserted two rows in between the top row of headers and the first data row. | ||
+ | * In cell A2, typed "Average" and in cell A3, typed "StDev". | ||
+ | * I then computed the Average log ratio for each chip (each column of data). In cell B2, I typed the following equation: | ||
=AVERAGE(B4:B5224) | =AVERAGE(B4:B5224) | ||
− | Note: We tried to do a keyboard shortcut using CTRL + Shift + Down buttons, but row 363 has a missing data so we had to manually type in "B5224" for the end of the data. | + | (Note: We tried to do a keyboard shortcut using CTRL + Shift + Down buttons, but row 363 has a missing data so we had to manually type in "B5224" for the end of the all the data.) |
+ | : and pressed "Enter". Excel then computed the average value of the cells specified in the range given inside the parentheses. Another approach for selecting all of the cells we needed was, instead of typing the cell designations, we could have clicked on the beginning cell, scrolled down to the bottom of the worksheet, and shift-clicked on the ending cell. | ||
+ | * I then computed the Standard Deviation of the log ratios on each chip (each column of data). In cell B3, I typed the following equation: | ||
+ | =STDEV(B4:B5224) | ||
+ | : and pressed "Enter". | ||
+ | * I copied these two equations (cells B2 and B3) and pasted them into the empty cells in the rest of the columns. Excel automatically changed the equation to match the cell designations for those columns. | ||
+ | * Now that the average and standard deviations of the log ratios have been computed for each chip, it's time for the scaling and centering based on these values. | ||
+ | * I copied the column headings for all of my data columns and then pasted them to the right of the last data column so that I had a second set of headers above blank colums of cells. I edited the names of the columns so that they now read: A1_scaled_centered, A2_scaled_centered, etc. | ||
+ | * In cell N4, I typed the following equation: | ||
+ | =(B4-B$2)/B$3 | ||
+ | : In this case, I wanted the data in cell B4 to have the average subtracted from it (cell B2) and be divided by the standard deviation (cell B3). I used the dollar sign symbols in front of the "2" and "3" to tell Excel to always reference that row in the equation, even though I will paste it for the entire column of 5221 genes. '''''This was important since we only want the first value (i.e. B4 into B5, etc. instead of B2 -> B3 or B3 -> B4)to change when we drag the equation down to the other cells in the column. Since B2 and B3 contained the average and standard deviations for the replicates and the values below them are the actual data for the replicates, we would want to fix the equation so that we are subtracting those fixed cells from the cells below them.''''' | ||
+ | * I copied and pasted this equation into the entire column. One easy way to do this is to click on the original cell with the equation and position the cursor at the bottom right corner. The cursor then change into a thin black plus sign (not a chubby white one). Double clicking this when it does will make the formula be magically copied to the entire column of genes. | ||
+ | * I then copied and pastes the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header. | ||
+ | |||
+ | === Perform statistical analysis on the ratios === | ||
+ | |||
+ | This step uses the scaled and centered data produced in the previous step. The following operations are what I executed: | ||
+ | |||
+ | * I inserted a new worksheet and name it "statistics" and copied the first column ("ID") from the "scaled_centered" worksheet into this new worksheet. | ||
+ | * I pasted the data into the first column of the new "statistics" worksheet. | ||
+ | * I went back to the "scaled_centered" worksheet and copied the columns that are designated "_scaled_centered". | ||
+ | * I then went to my new worksheet and clicked on the B1 cell and selected "Paste Special" from the Edit menu. A window opened; I clicked on the radio button for "Values" and clicked OK. This pasted the numerical result into my new worksheet instead of the equation which must make calculations on the fly. | ||
+ | * I then deleted Rows 2 and 3 where it says "Average" and "StDev" so that the data rows with gene IDs are immediately below the header row 1. | ||
+ | * Next, I went to a new column on the right of my worksheet and typed the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns. | ||
+ | * Excel compute the average log fold change for the replicates for each patient when I typed the equation: | ||
+ | =AVERAGE(B2:E2) | ||
+ | : into cell N2. I copied this equation and pasted it into the rest of the column. | ||
+ | * I created the equation for patients B and C as well and pasted it into their respective columns. | ||
+ | * I then needed to compute the average of the averages. I typed the header "Avg_LogFC_all" into the first cell in the next empty column and created the equation that will compute the average of the three previous averages I calculated and pasted it into this entire column. | ||
+ | * I inserted a new column next to the "Avg_LogFC_all" column that I computed in the previous step and labeled the column "Tstat". This will compute a T statistic that tells whether the scaled and centered average log ratio is significantly different than 0 (no change). Then I entered the equation: | ||
+ | =AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates)) | ||
+ | : (NOTE: in this case the number of replicates is 3.) Next, I copied the equation and pasted it into all rows in that column. | ||
+ | * I labeled the top cell in the next column "Pvalue". In the cell below the label, I entered the equation: | ||
+ | =TDIST(ABS(R2),degrees of freedom,2) | ||
+ | The number of degrees of freedom is the number of replicates minus one, so in our case there are 2 degrees of freedom. I copied the equation and pasted it into all rows in that column. | ||
+ | |||
+ | ==== Calculate the Bonferroni p value Correction ==== | ||
+ | |||
+ | * Before doing the following, I selected all of the first row and clicked on "Sort & Filter" -> "Filter" for the sanity check portion of the assignment. On the dropdown button for the Pvalue header, I went to "Number Filters", then selected "Less Than" and entered "0.05" for the text box next to "is less than". In the bottom left corner of Excel, I got 948 results. | ||
+ | * Then, I performed adjustments to the p value to correct for the [https://xkcd.com/882/ multiple testing problem]. I went ahead and labeled the next two columns to the right with the same label, Bonferroni_Pvalue. | ||
+ | * The equation for this is <code>=(Pvalue)*5221</code>, (in this case, the Pvalue = cell S2) Upon completion of this single computation, I used the trick to copy the formula throughout the column. | ||
+ | * Then I replaced any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second Bonferroni_Pvalue header: <code>=IF(T2>1,1,T2)</code>. I also used the trick to copy the formula throughout this column. | ||
+ | |||
+ | '' '''Checkpoint: The Excel file created from doing the procedures above can be located [[Media:Merrell Compiled Raw Data Vibrio TR 20151015.xls | here]].''' '' | ||
+ | |||
+ | ==== Calculate the Benjamini & Hochberg p value Correction ==== | ||
+ | |||
+ | * For this part, I inserted yet another worksheet and named it "B-H_Pvalue". | ||
+ | * I copied and pasted the "ID" column from my previous worksheet into the first column of this new worksheet. | ||
+ | * I inserted a new column on the very left and named it "MasterIndex". I needed to create a numerical index of genes so that I can always sort them back into the same order. | ||
+ | ** This is done by typing a "1" in cell A2 and a "2" in cell A3 and performing the trick for doing it for all the remaining columns: | ||
+ | ** I selected both cells with "1" and "2" and hovered my mouse over the bottom-right corner of the selection until it makes a thin black + sign. Double-clicking on the + sign would then fill the entire column with a series of numbers from 1 to 5221 (the number of genes on the microarray). | ||
+ | * For the following, I used Paste special > Paste values so that the values (instead of references to the other columns) are pasted. I copied the unadjusted p values from my previous worksheet and pasted it into Column C. | ||
+ | * I then selected all of columns A, B, and C, sorted by ascending values on Column C, and finally clicked the sort button from A to Z on the toolbar, in the window that appears, sort by column C, smallest to largest. | ||
+ | * Next, I typed the header "Rank" in cell D1. This is for creating a series of numbers in ascending order from 1 to 5221 in this column. This is the p value rank, smallest to largest. Same with the "MasterIndex"I typed "1" into cell D2 and "2" into cell D3, selected both cells D2 and D3, and double-clicked on the plus sign on the lower right-hand corner of my selection to fill the column with a series of numbers from 1 to 5221. | ||
+ | * Now I could calculate the Benjamini and Hochberg p value correction. I typed B-H_Pvalue in cell E1. I also entered the following formula in cell E2: <code>=(C2*5221)/D2</code>, which I then copied to the entire column. | ||
+ | * I also typed "B-H_Pvalue" into cell F1. | ||
+ | * With this, I typed the following formula into cell F2: <code>=IF(E2>1,1,E2)</code> and pressed enter. I copied that equation to the entire column. | ||
+ | * I selected columns A through F. I then sorted them by my MasterIndex in Column A in ascending order. | ||
+ | * I copied column F and used Paste special > Paste values to paste it into the next column on the right of my "statistics" sheet. | ||
+ | |||
+ | ==== Prepare file for GenMAPP ==== | ||
+ | |||
+ | * For the actual worksheet to feed into the GenMAPP program, I inserted a new worksheet and named it "forGenMAPP". | ||
+ | * I then went back to the "statistics" worksheet and chose Select All and Copy. | ||
+ | * In the new sheet, I clicked on cell A1 and selected Paste Special, clicked on the Values radio button, and clicked OK. The following steps are to now format this worksheet for import into GenMAPP. | ||
+ | * I selected Columns B through Q (all the fold changes), and selected the menu item Format > Cells. Under the number tab, I selected 2 decimal places and clicked OK. | ||
+ | * Next, I selected all the columns containing the p values. I selected the menu item Format > Cells, and ender the number tab, selected 4 decimal places. | ||
+ | * Since they are no longer needed, I deleteed the left-most Bonferroni p value column, preserving the one that shows the result of my "if" statement. | ||
+ | * I then inserted a column to the right of the "ID" column. I named the header at the top cell of this column "SystemCode"and filled the entire column (each cell) with the letter "N" using the trick to copy values to the rest of the column. | ||
+ | * Then, I selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu. (This will be the file type that I fed into GenMAPP). Excel made me click through a couple of warnings because it doesn't like the user going all independent and choosing a different file type than the native .xls so I just clicked OK in all of them. The new *.txt file is now ready for import into GenMAPP. | ||
+ | ** I uploaded both the .xls and .txt files (seen at the checkpoint below) that I have just created into my journal page in the class wiki and added my initials to differentiate it from the other students' files so that they don't overwrite it. | ||
+ | |||
+ | '' '''Checkpoint: The files created can be found [[Media:Merrell Compiled Raw Data Vibrio TR 20151019.xls | here]] (Excel) and [[Media:Merrell Compiled Raw Data Vibrio TR 20151019.txt | here]] (txt).''' '' | ||
+ | |||
+ | Here is the Sanity check table I created: | ||
+ | * [[Image:Sanity checks TR.jpg]] | ||
+ | |||
+ | == Part 2 == | ||
+ | * ''' I will be working on the 2009 data. ''' | ||
+ | |||
+ | === Files created from Part 2 === | ||
+ | Files created (a zipped folder with these files can be found near the end of this wiki page at the top of the list of assignments): | ||
+ | * [[Media:Merrell Compiled Raw Data Vibrio TR 20151022.gex | Updated gex file]] | ||
+ | * [[Media:Merrell Compiled Raw Data Vibrio TR 20151022-Criterion0-GO.txt | GO txt file]] | ||
+ | * [[Media:Merrell Compiled Raw Data Vibrio TR 20151022-Criterion0-GO.xlsx | GO Excel file]] | ||
+ | * [[Media:Merrell Compiled Raw Data Vibrio TR 20151022.gmf | GMF file]] | ||
+ | * [[Media:3’-5’-exoribonuclease activity TR.mapp | VC0647 - exoribonuclease result]] | ||
+ | |||
+ | Each time I launched GenMAPP, I had to make sure that the correct Gene Database (.gdb) is loaded. The process for doing this is as follows: | ||
+ | * Look in the lower left-hand corner of the window to see which Gene Database has been selected. | ||
+ | * If I needed to change the Gene Database, I selected Data > Choose Gene Database. Then I navigated to the directory C:\GenMAPP 2 Data\Gene Databases and chose the correct one for my species. | ||
+ | * For this assignment, I had to download the appropriate ''Vibrio cholerae'' Gene Database. | ||
+ | ** Half of the class used the Vc-Std_External_20090622.gdb Gene Database that was initially created by the Fall 2008 Biological Databases class. | ||
+ | *** I downloaded this Gene Database from, [http://sourceforge.net/projects/xmlpipedb/files/V.%20cholerae%20Gene%20Database/V.%20cholerae%2020090622/Vc-Std_External_20090622.zip/download '''''this link''''' on the XMLPipeDB SourceForge Download page]. | ||
+ | ** '''Mine happened to be the 2009 Gene Database, and my partner worked on the 2010 one, the database that Drs. Dahlquist and Dionisio''' | ||
+ | * I clicked on the link for the Gene Database to which I have been assigned, downloaded the file, and saved it into the folder C:\GenMAPP 2 Data\Gene Databases, and extracted it. | ||
+ | |||
+ | '' '''Checkpoint: There were 772 errors in the 2009 data file. The gex file is located [[Media:Merrell Compiled Raw Data Vibrio TR 20151019.gex | here]]. The EX.txt file can be found [[Media:Merrell Compiled Raw Data Vibrio TR 20151019.EX.txt | here]].''' '' | ||
+ | |||
+ | [[Image:GenMAPP errors.jpg]] | ||
+ | |||
+ | === GenMAPP Expression Dataset Manager Procedure === | ||
+ | |||
+ | * I launched the GenMAPP Program and checked to make sure the correct Gene Database is loaded. | ||
+ | ** This can be done by looking in the lower, left-hand corner of the main GenMAPP Drafting Board window to see the name of the Gene Database that is loaded. | ||
+ | * Next, I selected the Data menu from the main Drafting Board window and chose Expression Dataset Manager from the drop-down list and waited for the Expression Dataset Manager window to open. | ||
+ | * Then, I selected New Dataset from the Expression Datasets menu, selected the tab-delimited text file that I formatted for GenMAPP (.txt) in the procedure above from the file dialog box that appears. | ||
+ | * The Data Type Specification window then appeared. GenMAPP expected that I provide numerical data. If any of the columns had text (character) data, I would check the box next to the field (column) name. | ||
+ | ** ''The Vibrio data we have been working with does not have any text (character) data in it.'' | ||
+ | * Then, I allowed the Expression Dataset Manager to convert my data. | ||
+ | ** This probably only took a couple of seconds to 1 minute. When the process completed, the converted dataset was active in the Expression Dataset Manager window and the file saved in the same folder the raw data file was in, named the same except with a .gex extension; for example, MyExperiment.gex. | ||
+ | ** Here is an image of what the loading screen should look like: | ||
+ | ** [[Image:GenMAPP loading.jpg]] | ||
+ | ** A message appeared saying that the Expression Dataset Manager could not convert one or more lines of data. Lines that generated an error during the conversion of a raw data file were not added to the Expression Dataset. Instead, an exception file was created. The exception file is given the same name as my raw data file with .EX before the extension (e.g., MyExperiment.EX.txt). The exception file contained all of my raw data, with the addition of a column named ~Error~. This column contained either error messages or, if the program found no errors, a single space character. | ||
+ | *** '''The number of errors came out to be 772. After opening the EX txt file, I discovered that the errors were caused by "Gene not found in OrderedLocusNames or any related system." for all 772 errors that came up when the was processed.''' | ||
+ | *** '''Between my partner and I, the one who got the most number of errors turned out to be mine: his only had 121 errors while I had 772. I think this is because between the years 2009 and 2010, it is possible that the missing genes have been updated in the 2010 database to contain the missing ones from the 2009 database.''' | ||
+ | *** '''I then upload my exceptions file: <code>EX.txt</code> to my wiki page. | ||
+ | * I customized the new Expression Dataset by creating new Color Sets which contained the instructions to GenMAPP for displaying data on MAPPs. | ||
+ | ** Color Sets contain the instructions to GenMAPP for displaying data from an Expression Dataset on MAPPs. I created a Color Set by filling in the following different fields in the Color Set area of the Expression Dataset Manager: a name for the Color Set, the gene value, and the criteria that determined how a gene object is colored on the MAPP. I entered a name in the Color Set Name field that is 20 characters or fewer. | ||
+ | ** The Gene Value is the data displayed next to the gene box on a MAPP. I selected the column of data used as the Gene Value from the drop down list or selected [none]. I used "Avg_LogFC_all" for the Vibrio dataset I just created. | ||
+ | ** I activated the Criteria Builder by clicking the New button. | ||
+ | ** I then entered a name for the criterion in the Label in Legend field. | ||
+ | ** Next, I chose a color for the criterion by left-clicking on the Color box. I chose green for decreased and red for increased expression. | ||
+ | ** I also stated the criterion for color-coding a gene in the Criterion field. | ||
+ | * After completing a new criterion, I added the criterion entry (label, criterion, and color) to the Criteria List by clicking the Add button. | ||
+ | ** For the Vibrio dataset, I created two criterion. "Increased" was [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05 and "Decreased" was [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05. | ||
+ | *** The buttons to the right of the list represented actions that can be performed on individual criteria. I modified a criterion label, color, or the criterion itself, by first selecting the criterion in the list by left-clicking on it, and then clicking the Edit button. This put the selected criterion into the Criteria Builder to be modified. I then clicked the Save button to save changes to the modified criterion; I clicked the Add button to add it to the list as a separate criterion. The order of Criteria in the list has significance to GenMAPP. When applying an Expression Dataset and Color Set to a MAPP, GenMAPP examines the expression data for a particular gene object and applies the color for the first criterion in the list that is true. Therefore, it is imperative that when criteria overlap the user put the most important or least inclusive criteria in the list first. To change the order of the criteria in the list, left-click on the criterion to select it and then click the Move Up or Move Down buttons. No criteria met and Not found are always the last two positions in the list. | ||
+ | * I saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. | ||
+ | * I then exited the Expression Dataset Manager to view the Color Sets on a MAPP by clicking the close box in the upper right hand corner of the window. | ||
+ | * This is the result of entering the criterion: | ||
+ | [[Image:GenMAPP creating criterion.jpg]] | ||
+ | |||
+ | === MAPPFinder Procedure === | ||
+ | |||
+ | * I launched the MAPPFinder program. | ||
+ | * I made sure that the Gene Database for the correct species is loaded. The name of the Gene Database appears at the bottom of the window. If this is not the right one, go to File > Choose Gene Database and choose the correct one. (The Gene Databases are stored in the folder C:\GenMAPP 2 Data\Gene Databases\.) | ||
+ | * I clicked on the button "Calculate New Results". | ||
+ | * [[Image:MAPPFinder menu.jpg]] | ||
+ | * I clicked on "Find File". | ||
+ | ** MAPPFinder found it for me already. | ||
+ | * I chose the Color Set and Criteria with which to filter the data and clicked on the "Decreased" criteria in the right-hand box. | ||
+ | * I checked the boxes next to "Gene Ontology" and "p value". | ||
+ | ** [[Image:MAPPFinder calculate results window.jpg]] | ||
+ | * I then clicked the "Browse" button and created a meaningful filename for my results. | ||
+ | * I also clicked "Run MAPPFinder". The analysis took only a couple of minutes. | ||
+ | * When the results have been calculated, a Gene Ontology browser opened showing the results. All of the Gene Ontology terms that have at least 3 genes measured and a p value of less than 0.05 were highlighted yellow. A term with a p value less than 0.05 is considered a "significant" result. I then browsed through the tree to see my results. | ||
+ | * To see a list of the most significant Gene Ontology terms, I clicked on the menu item "Show Ranked List". | ||
+ | ** '''The top 10 terms that I got were:''' | ||
+ | **# protein folding | ||
+ | **# aromatic amino acid family biosynthetic process | ||
+ | **# chorismate metabolic process | ||
+ | **# unfolded protein binding | ||
+ | **# cytoplasm | ||
+ | **# membrane | ||
+ | **# protein-N(PI)-phosphohistidine-sugar phosphotransferase activity | ||
+ | **# phosphoenolypyruvate-dependent sugar phosphotransferase system | ||
+ | **# zinc ion binding | ||
+ | **# intracellular part | ||
+ | ** '''When comparing with my partner, our top 10 were almost completely different except for 2: cytoplasm, and protein folding. I think this is the case because of the updates that could have been done to the newer database.''' | ||
+ | * First, in the main MAPPFinder Browser window, I clicked on the button "Collapse the Tree". Then, searched for the genes that were mentioned by Merrell et al. (2002), VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. I typed the identifier for one of these genes into the MAPPFinder browser gene ID search field. I chose "OrderedLocusNames" from the drop-down menu to the right of the search field and clicked on the GeneID Search button. The GO term(s) that are associated with that gene were highlighted in blue. These are the GO terms associated with each of the genes | ||
+ | ** VC0028: Not found | ||
+ | ** VC0941: Not found | ||
+ | ** VC0869: Not found | ||
+ | ** VC0051: Not found | ||
+ | ** VC0647: transferase activity, nucleotidyltrasnferase activity, polyribonucleotide nucleotidyltransferase activity, 3'-5'-exoribuclease activity, RNA binding, cytoplasm, RNA processing, and mRNA catabolic process. | ||
+ | ** VC0468: Not found | ||
+ | ** VC2350: Not found | ||
+ | ** VCA0583: transport, outer membrane-bounded periplasmic space, and transporter activity. | ||
+ | * The results that I got were not the same as my partner's. I think this is because the 2010 database is more updated, and thus, has more genes that would match. | ||
+ | * I clicked on one of the GO terms that are associated with one of the genes you looked up in the previous step. A MAPP will open listing all of the genes (as boxes) associated with that GO term. The genes named within the map are based on the UniProt identification system. To match the gene of interest to its identification go to the [http://www.uniprot.org/ UniProt site] and typed in gene ID into the search bar. Moreover, the genes on the MAPP will be color-coded with the gene expression data from the microarray experiment. | ||
+ | **'''I clicked on the GO term 3'-5'-exoribuclease activity, which is under the gene VC0647. The expression of the gene was changed significantly, since the color of PNP_VIBCH was green on the MAPP window.''' | ||
+ | ** I then double-clicked on the gene box. This opened an Internet Explorer window called the "Backpage" for this gene. This page has linked to pages for this gene in the public databases. '''This gene is involved in mRNA degradation. It catalyzes the phosphorolysis of single-stranded polyribonucleotides processively in the 3'- to 5'-direction.''' | ||
+ | ** The MAPP that was created was stored in the directory, C:\GenMAPP 2 Data\MAPPs\VC GO. '''The mapp file can be located [[Media:3’-5’-exoribonuclease activity TR.mapp | here]].''' | ||
+ | * In Windows, I made a copy of my results (XXX-CriterionX-GO.txt) file. | ||
+ | ** "XXX" refers to the name I gave to my results file. | ||
+ | ** In my case, "Criterion0" is the criterion for decreased expression and "Criterion1" is for increased since I started out doing the decreased expression. | ||
+ | ** '''The criterion file can be accessed [[Media:Merrell Compiled Raw Data Vibrio TR 20151022-Criterion0-GO.txt | here (decreased)]] and [[Media:Merrell Compiled Raw Data Vibrio TR 20151019-Criterion1-GO.txt | here]].''' | ||
+ | * I launched Microsoft Excel and opened the copies of the .txt files in Excel. This showed the same data that was in the MAPPFinder Browser, but in tabular form. | ||
+ | * I then looked at the top of the spreadsheet. There are rows of information that gave the background information on how MAPPFinder made the calculations. '''I compared this information with my partner who used a different version of the Vibrio Gene Database.''' | ||
+ | ** Here are our results (The top is mine (decreased) and the bottom is Erich's): | ||
+ | *** [[Image:BioDB TR 20151027.jpg | Trixie's results]] | ||
+ | *** [[Image:BioDB EY20152710.png | Erich's results]] | ||
+ | *** All of the numbers were different except for the 5221 probes in the dataset (obviously, since we both had the same number of genes to begin with). They are different because we used databases from different years. | ||
+ | * I filtered this list to show the top GO terms represented in my data for both the "Increased" and "Decreased" criteria. I actually managed to filter it down to exactly 20 terms. I used the custom filter by clicking on the drop-down arrow for the column I wished to filter and chose "(Custom…)". A window opened giving me choices on how I wanted to filter. I set these two filters: | ||
+ | Z Score (in column N) greater than 2 | ||
+ | PermuteP (in column O) less than 0.05 | ||
+ | |||
+ | :I used these filters to narrow down the results to just 20 (decreased): | ||
+ | |||
+ | Number Changed (in column I) greater than or equal to 4 AND less than 100 | ||
+ | Percent Changed (in column L) greater than or equal to 37% | ||
+ | |||
+ | :I used these filters to narrow down the results to just 21 (increased): | ||
+ | Percent Changed (in column L) greater than or equal to 26% | ||
+ | |||
+ | * I saved my changes to an Excel spreadsheet and selected File > Save As and selected Excel workbook (.xls) from the drop-down menu. | ||
+ | * '''When I looked up the terms in the MAPP window, about 50%-75% of the terms were related to each other. The Excel sheet with increased expression can be found [[Media:Merrell Compiled Raw Data Vibrio TR 20151027-Criterion1-GO.xlsx| here]]. The sheet for decreased can be found [[Media:Merrell Compiled Raw Data Vibrio TR 20151027-Criterion0-GO.xlsx| here]]. One observation I could make is that more terms were related in the decreased than in the increased file.''' | ||
+ | * '''I started interpreting these results by looking up the definitions online and on the [http://www.geneontology.org geneontology website].''' | ||
+ | |||
+ | From what Merrell et al discussed in their paper, it would seem that the V. cholerae bacteria thrive in the acidic environment that exists in the gastrointestinal tract of humans. However, it would also seem that the lower the pH, the more likely an infection will occur in a human. The results from Merrell et al show that as the pH of the tract increases, the expression of certain genes in V. cholerae also increase in order to allow the organism to survive the harsher conditions. The statistical analysis that we performed for this assignment suggests that this would also be the case from the varying expressions for the human gastrointestinal tract such as the ones that center around the proteins. Changes in gene expression would enable V. cholerae to survive longer even in the pH conditions that exist within a person's stomach. The following GO terms: protein folding, chorismate metabolic process, aromatic amino acid family biosynthetic process, unfolded protein binding, dicarboxylic acid metabolic process, protein-N(PI)-phosphohistidine-sugar phosphotransferase activity, | ||
+ | aromatic amino acid family metabolic process, phosphoenolpyruvate-dependent sugar phosphotransferase system, translational elongation, translation elongation factor activity, peptidyl-prolyl cis-trans isomerase activity, cis-trans isomerase activity, endonuclease activity, glucose catabolic process, hexose catabolic process, translation factor activity, nucleic acid binding, translation regulator activity, ribonuclease activity, monosaccharide catabolic process, GTPase activity, suggest that there are protein processes that help the bacteria survive in the enzyme-filled space of our digestive tracts. From here, I believe that these GO terms are related to the pathogenecity of the bacterium by how they describe how the V. cholerae bacteria is somehow able to keep living in the stomach of patients even when the pH fluctuates. | ||
+ | |||
+ | * '''The gmf file that was created, which is necessary to re-open the results in MAPPFinder, is located [[Media:Merrell Compiled Raw Data Vibrio TR 20151022.gmf | here]].''' | ||
+ | |||
+ | === Conclusion === | ||
+ | |||
+ | * Our class conducted an analysis of raw data of ''Vibrio cholerae'' taken from the year 2009 and again for the year 2010. We first had to conduct a statistical analysis of the raw data using Excel in order to scale, center, and normalize the log fold changes of the genes in the raw data and observe their significance. We then used Dr. Dahlquist's post-doc project called GenMAPP, which used our processed data to generate a gene expression profile of the DNA microarray. After using GenMAPP and MAPPFinder to observe how the terms are connected to the experiment with ''Vibrio cholerae'', we compared our results to those already conducted by Merrell et al on the same organism. Observing how our results matched up with theirs, we can argue that the expression changes in the organism are caused by resistance to certain environmental stresses where these organisms are usually found. | ||
+ | |||
+ | ==== List of Files to Upload ==== | ||
+ | |||
+ | The following files have been zipped together [[Media:GenMAPP and MAPPFinder files.zip | here]] for easy download and access: | ||
+ | |||
+ | # Exceptions file when I imported data into GenMAPP: <code>.EX.txt</code> | ||
+ | # Expression Dataset file: <code>.gex</code> | ||
+ | # GO results file: <code>XXX-CriterionX-GO.txt</code> | ||
+ | # GO results saved as an Excel spreadsheet with filters applied: <code>.xls</code> | ||
+ | # The MAPP I looked at: <code>.mapp</code> | ||
+ | # The MAPPFinder GO mappings file: <code>.gmf</code> | ||
+ | |||
{{Template:Troque_Journal}} | {{Template:Troque_Journal}} |
Latest revision as of 10:16, 27 October 2015
Sources
- The methods described in Part 1 of this page are taken from this openwetware page.
- The methods for Part 2 have been adapted from this page.
Files created
- The Excel file created on Thursday (October 15, 2015) can be downloaded here.
- A more updated Excel file with the B-H p-value correction can be downloaded here.
- The tab delimited txt file can be seen here
Things to note
- Always save your work when you have a chance.
- For this assignment, my partner was Erich Yanoschik.
- For Part 2, I worked on the 2009 data while Erich worked on the 2010 data.
- On Thursday (October 22), we were assigned to analyze decreased expressions of our data using GenMAPP.
- I met with Erich on Monday (October 26) to work on this assignment.
Part 1
Normalize the log ratios for the set of slides in the experiment
To scale and center the data (between chip normalization) I performed the following operations:
- Inserted a new Worksheet into my Excel file, and named it "scaled_centered".
- Selected all and copied and pasted everything from the "compiled_raw_data" worksheet into this new "scaled_centered".
- Inserted two rows in between the top row of headers and the first data row.
- In cell A2, typed "Average" and in cell A3, typed "StDev".
- I then computed the Average log ratio for each chip (each column of data). In cell B2, I typed the following equation:
=AVERAGE(B4:B5224)
(Note: We tried to do a keyboard shortcut using CTRL + Shift + Down buttons, but row 363 has a missing data so we had to manually type in "B5224" for the end of the all the data.)
- and pressed "Enter". Excel then computed the average value of the cells specified in the range given inside the parentheses. Another approach for selecting all of the cells we needed was, instead of typing the cell designations, we could have clicked on the beginning cell, scrolled down to the bottom of the worksheet, and shift-clicked on the ending cell.
- I then computed the Standard Deviation of the log ratios on each chip (each column of data). In cell B3, I typed the following equation:
=STDEV(B4:B5224)
- and pressed "Enter".
- I copied these two equations (cells B2 and B3) and pasted them into the empty cells in the rest of the columns. Excel automatically changed the equation to match the cell designations for those columns.
- Now that the average and standard deviations of the log ratios have been computed for each chip, it's time for the scaling and centering based on these values.
- I copied the column headings for all of my data columns and then pasted them to the right of the last data column so that I had a second set of headers above blank colums of cells. I edited the names of the columns so that they now read: A1_scaled_centered, A2_scaled_centered, etc.
- In cell N4, I typed the following equation:
=(B4-B$2)/B$3
- In this case, I wanted the data in cell B4 to have the average subtracted from it (cell B2) and be divided by the standard deviation (cell B3). I used the dollar sign symbols in front of the "2" and "3" to tell Excel to always reference that row in the equation, even though I will paste it for the entire column of 5221 genes. This was important since we only want the first value (i.e. B4 into B5, etc. instead of B2 -> B3 or B3 -> B4)to change when we drag the equation down to the other cells in the column. Since B2 and B3 contained the average and standard deviations for the replicates and the values below them are the actual data for the replicates, we would want to fix the equation so that we are subtracting those fixed cells from the cells below them.
- I copied and pasted this equation into the entire column. One easy way to do this is to click on the original cell with the equation and position the cursor at the bottom right corner. The cursor then change into a thin black plus sign (not a chubby white one). Double clicking this when it does will make the formula be magically copied to the entire column of genes.
- I then copied and pastes the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header.
Perform statistical analysis on the ratios
This step uses the scaled and centered data produced in the previous step. The following operations are what I executed:
- I inserted a new worksheet and name it "statistics" and copied the first column ("ID") from the "scaled_centered" worksheet into this new worksheet.
- I pasted the data into the first column of the new "statistics" worksheet.
- I went back to the "scaled_centered" worksheet and copied the columns that are designated "_scaled_centered".
- I then went to my new worksheet and clicked on the B1 cell and selected "Paste Special" from the Edit menu. A window opened; I clicked on the radio button for "Values" and clicked OK. This pasted the numerical result into my new worksheet instead of the equation which must make calculations on the fly.
- I then deleted Rows 2 and 3 where it says "Average" and "StDev" so that the data rows with gene IDs are immediately below the header row 1.
- Next, I went to a new column on the right of my worksheet and typed the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns.
- Excel compute the average log fold change for the replicates for each patient when I typed the equation:
=AVERAGE(B2:E2)
- into cell N2. I copied this equation and pasted it into the rest of the column.
- I created the equation for patients B and C as well and pasted it into their respective columns.
- I then needed to compute the average of the averages. I typed the header "Avg_LogFC_all" into the first cell in the next empty column and created the equation that will compute the average of the three previous averages I calculated and pasted it into this entire column.
- I inserted a new column next to the "Avg_LogFC_all" column that I computed in the previous step and labeled the column "Tstat". This will compute a T statistic that tells whether the scaled and centered average log ratio is significantly different than 0 (no change). Then I entered the equation:
=AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates))
- (NOTE: in this case the number of replicates is 3.) Next, I copied the equation and pasted it into all rows in that column.
- I labeled the top cell in the next column "Pvalue". In the cell below the label, I entered the equation:
=TDIST(ABS(R2),degrees of freedom,2)
The number of degrees of freedom is the number of replicates minus one, so in our case there are 2 degrees of freedom. I copied the equation and pasted it into all rows in that column.
Calculate the Bonferroni p value Correction
- Before doing the following, I selected all of the first row and clicked on "Sort & Filter" -> "Filter" for the sanity check portion of the assignment. On the dropdown button for the Pvalue header, I went to "Number Filters", then selected "Less Than" and entered "0.05" for the text box next to "is less than". In the bottom left corner of Excel, I got 948 results.
- Then, I performed adjustments to the p value to correct for the multiple testing problem. I went ahead and labeled the next two columns to the right with the same label, Bonferroni_Pvalue.
- The equation for this is
=(Pvalue)*5221
, (in this case, the Pvalue = cell S2) Upon completion of this single computation, I used the trick to copy the formula throughout the column. - Then I replaced any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second Bonferroni_Pvalue header:
=IF(T2>1,1,T2)
. I also used the trick to copy the formula throughout this column.
Checkpoint: The Excel file created from doing the procedures above can be located here.
Calculate the Benjamini & Hochberg p value Correction
- For this part, I inserted yet another worksheet and named it "B-H_Pvalue".
- I copied and pasted the "ID" column from my previous worksheet into the first column of this new worksheet.
- I inserted a new column on the very left and named it "MasterIndex". I needed to create a numerical index of genes so that I can always sort them back into the same order.
- This is done by typing a "1" in cell A2 and a "2" in cell A3 and performing the trick for doing it for all the remaining columns:
- I selected both cells with "1" and "2" and hovered my mouse over the bottom-right corner of the selection until it makes a thin black + sign. Double-clicking on the + sign would then fill the entire column with a series of numbers from 1 to 5221 (the number of genes on the microarray).
- For the following, I used Paste special > Paste values so that the values (instead of references to the other columns) are pasted. I copied the unadjusted p values from my previous worksheet and pasted it into Column C.
- I then selected all of columns A, B, and C, sorted by ascending values on Column C, and finally clicked the sort button from A to Z on the toolbar, in the window that appears, sort by column C, smallest to largest.
- Next, I typed the header "Rank" in cell D1. This is for creating a series of numbers in ascending order from 1 to 5221 in this column. This is the p value rank, smallest to largest. Same with the "MasterIndex"I typed "1" into cell D2 and "2" into cell D3, selected both cells D2 and D3, and double-clicked on the plus sign on the lower right-hand corner of my selection to fill the column with a series of numbers from 1 to 5221.
- Now I could calculate the Benjamini and Hochberg p value correction. I typed B-H_Pvalue in cell E1. I also entered the following formula in cell E2:
=(C2*5221)/D2
, which I then copied to the entire column. - I also typed "B-H_Pvalue" into cell F1.
- With this, I typed the following formula into cell F2:
=IF(E2>1,1,E2)
and pressed enter. I copied that equation to the entire column. - I selected columns A through F. I then sorted them by my MasterIndex in Column A in ascending order.
- I copied column F and used Paste special > Paste values to paste it into the next column on the right of my "statistics" sheet.
Prepare file for GenMAPP
- For the actual worksheet to feed into the GenMAPP program, I inserted a new worksheet and named it "forGenMAPP".
- I then went back to the "statistics" worksheet and chose Select All and Copy.
- In the new sheet, I clicked on cell A1 and selected Paste Special, clicked on the Values radio button, and clicked OK. The following steps are to now format this worksheet for import into GenMAPP.
- I selected Columns B through Q (all the fold changes), and selected the menu item Format > Cells. Under the number tab, I selected 2 decimal places and clicked OK.
- Next, I selected all the columns containing the p values. I selected the menu item Format > Cells, and ender the number tab, selected 4 decimal places.
- Since they are no longer needed, I deleteed the left-most Bonferroni p value column, preserving the one that shows the result of my "if" statement.
- I then inserted a column to the right of the "ID" column. I named the header at the top cell of this column "SystemCode"and filled the entire column (each cell) with the letter "N" using the trick to copy values to the rest of the column.
- Then, I selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu. (This will be the file type that I fed into GenMAPP). Excel made me click through a couple of warnings because it doesn't like the user going all independent and choosing a different file type than the native .xls so I just clicked OK in all of them. The new *.txt file is now ready for import into GenMAPP.
- I uploaded both the .xls and .txt files (seen at the checkpoint below) that I have just created into my journal page in the class wiki and added my initials to differentiate it from the other students' files so that they don't overwrite it.
Checkpoint: The files created can be found here (Excel) and here (txt).
Here is the Sanity check table I created:
Part 2
- I will be working on the 2009 data.
Files created from Part 2
Files created (a zipped folder with these files can be found near the end of this wiki page at the top of the list of assignments):
Each time I launched GenMAPP, I had to make sure that the correct Gene Database (.gdb) is loaded. The process for doing this is as follows:
- Look in the lower left-hand corner of the window to see which Gene Database has been selected.
- If I needed to change the Gene Database, I selected Data > Choose Gene Database. Then I navigated to the directory C:\GenMAPP 2 Data\Gene Databases and chose the correct one for my species.
- For this assignment, I had to download the appropriate Vibrio cholerae Gene Database.
- Half of the class used the Vc-Std_External_20090622.gdb Gene Database that was initially created by the Fall 2008 Biological Databases class.
- I downloaded this Gene Database from, this link on the XMLPipeDB SourceForge Download page.
- Mine happened to be the 2009 Gene Database, and my partner worked on the 2010 one, the database that Drs. Dahlquist and Dionisio
- Half of the class used the Vc-Std_External_20090622.gdb Gene Database that was initially created by the Fall 2008 Biological Databases class.
- I clicked on the link for the Gene Database to which I have been assigned, downloaded the file, and saved it into the folder C:\GenMAPP 2 Data\Gene Databases, and extracted it.
Checkpoint: There were 772 errors in the 2009 data file. The gex file is located here. The EX.txt file can be found here.
GenMAPP Expression Dataset Manager Procedure
- I launched the GenMAPP Program and checked to make sure the correct Gene Database is loaded.
- This can be done by looking in the lower, left-hand corner of the main GenMAPP Drafting Board window to see the name of the Gene Database that is loaded.
- Next, I selected the Data menu from the main Drafting Board window and chose Expression Dataset Manager from the drop-down list and waited for the Expression Dataset Manager window to open.
- Then, I selected New Dataset from the Expression Datasets menu, selected the tab-delimited text file that I formatted for GenMAPP (.txt) in the procedure above from the file dialog box that appears.
- The Data Type Specification window then appeared. GenMAPP expected that I provide numerical data. If any of the columns had text (character) data, I would check the box next to the field (column) name.
- The Vibrio data we have been working with does not have any text (character) data in it.
- Then, I allowed the Expression Dataset Manager to convert my data.
- This probably only took a couple of seconds to 1 minute. When the process completed, the converted dataset was active in the Expression Dataset Manager window and the file saved in the same folder the raw data file was in, named the same except with a .gex extension; for example, MyExperiment.gex.
- Here is an image of what the loading screen should look like:
- A message appeared saying that the Expression Dataset Manager could not convert one or more lines of data. Lines that generated an error during the conversion of a raw data file were not added to the Expression Dataset. Instead, an exception file was created. The exception file is given the same name as my raw data file with .EX before the extension (e.g., MyExperiment.EX.txt). The exception file contained all of my raw data, with the addition of a column named ~Error~. This column contained either error messages or, if the program found no errors, a single space character.
- The number of errors came out to be 772. After opening the EX txt file, I discovered that the errors were caused by "Gene not found in OrderedLocusNames or any related system." for all 772 errors that came up when the was processed.
- Between my partner and I, the one who got the most number of errors turned out to be mine: his only had 121 errors while I had 772. I think this is because between the years 2009 and 2010, it is possible that the missing genes have been updated in the 2010 database to contain the missing ones from the 2009 database.
- I then upload my exceptions file:
EX.txt
to my wiki page.
- I customized the new Expression Dataset by creating new Color Sets which contained the instructions to GenMAPP for displaying data on MAPPs.
- Color Sets contain the instructions to GenMAPP for displaying data from an Expression Dataset on MAPPs. I created a Color Set by filling in the following different fields in the Color Set area of the Expression Dataset Manager: a name for the Color Set, the gene value, and the criteria that determined how a gene object is colored on the MAPP. I entered a name in the Color Set Name field that is 20 characters or fewer.
- The Gene Value is the data displayed next to the gene box on a MAPP. I selected the column of data used as the Gene Value from the drop down list or selected [none]. I used "Avg_LogFC_all" for the Vibrio dataset I just created.
- I activated the Criteria Builder by clicking the New button.
- I then entered a name for the criterion in the Label in Legend field.
- Next, I chose a color for the criterion by left-clicking on the Color box. I chose green for decreased and red for increased expression.
- I also stated the criterion for color-coding a gene in the Criterion field.
- After completing a new criterion, I added the criterion entry (label, criterion, and color) to the Criteria List by clicking the Add button.
- For the Vibrio dataset, I created two criterion. "Increased" was [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05 and "Decreased" was [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05.
- The buttons to the right of the list represented actions that can be performed on individual criteria. I modified a criterion label, color, or the criterion itself, by first selecting the criterion in the list by left-clicking on it, and then clicking the Edit button. This put the selected criterion into the Criteria Builder to be modified. I then clicked the Save button to save changes to the modified criterion; I clicked the Add button to add it to the list as a separate criterion. The order of Criteria in the list has significance to GenMAPP. When applying an Expression Dataset and Color Set to a MAPP, GenMAPP examines the expression data for a particular gene object and applies the color for the first criterion in the list that is true. Therefore, it is imperative that when criteria overlap the user put the most important or least inclusive criteria in the list first. To change the order of the criteria in the list, left-click on the criterion to select it and then click the Move Up or Move Down buttons. No criteria met and Not found are always the last two positions in the list.
- For the Vibrio dataset, I created two criterion. "Increased" was [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05 and "Decreased" was [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05.
- I saved the entire Expression Dataset by selecting Save from the Expression Dataset menu.
- I then exited the Expression Dataset Manager to view the Color Sets on a MAPP by clicking the close box in the upper right hand corner of the window.
- This is the result of entering the criterion:
MAPPFinder Procedure
- I launched the MAPPFinder program.
- I made sure that the Gene Database for the correct species is loaded. The name of the Gene Database appears at the bottom of the window. If this is not the right one, go to File > Choose Gene Database and choose the correct one. (The Gene Databases are stored in the folder C:\GenMAPP 2 Data\Gene Databases\.)
- I clicked on the button "Calculate New Results".
- I clicked on "Find File".
- MAPPFinder found it for me already.
- I chose the Color Set and Criteria with which to filter the data and clicked on the "Decreased" criteria in the right-hand box.
- I checked the boxes next to "Gene Ontology" and "p value".
- I then clicked the "Browse" button and created a meaningful filename for my results.
- I also clicked "Run MAPPFinder". The analysis took only a couple of minutes.
- When the results have been calculated, a Gene Ontology browser opened showing the results. All of the Gene Ontology terms that have at least 3 genes measured and a p value of less than 0.05 were highlighted yellow. A term with a p value less than 0.05 is considered a "significant" result. I then browsed through the tree to see my results.
- To see a list of the most significant Gene Ontology terms, I clicked on the menu item "Show Ranked List".
- The top 10 terms that I got were:
- protein folding
- aromatic amino acid family biosynthetic process
- chorismate metabolic process
- unfolded protein binding
- cytoplasm
- membrane
- protein-N(PI)-phosphohistidine-sugar phosphotransferase activity
- phosphoenolypyruvate-dependent sugar phosphotransferase system
- zinc ion binding
- intracellular part
- When comparing with my partner, our top 10 were almost completely different except for 2: cytoplasm, and protein folding. I think this is the case because of the updates that could have been done to the newer database.
- The top 10 terms that I got were:
- First, in the main MAPPFinder Browser window, I clicked on the button "Collapse the Tree". Then, searched for the genes that were mentioned by Merrell et al. (2002), VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. I typed the identifier for one of these genes into the MAPPFinder browser gene ID search field. I chose "OrderedLocusNames" from the drop-down menu to the right of the search field and clicked on the GeneID Search button. The GO term(s) that are associated with that gene were highlighted in blue. These are the GO terms associated with each of the genes
- VC0028: Not found
- VC0941: Not found
- VC0869: Not found
- VC0051: Not found
- VC0647: transferase activity, nucleotidyltrasnferase activity, polyribonucleotide nucleotidyltransferase activity, 3'-5'-exoribuclease activity, RNA binding, cytoplasm, RNA processing, and mRNA catabolic process.
- VC0468: Not found
- VC2350: Not found
- VCA0583: transport, outer membrane-bounded periplasmic space, and transporter activity.
- The results that I got were not the same as my partner's. I think this is because the 2010 database is more updated, and thus, has more genes that would match.
- I clicked on one of the GO terms that are associated with one of the genes you looked up in the previous step. A MAPP will open listing all of the genes (as boxes) associated with that GO term. The genes named within the map are based on the UniProt identification system. To match the gene of interest to its identification go to the UniProt site and typed in gene ID into the search bar. Moreover, the genes on the MAPP will be color-coded with the gene expression data from the microarray experiment.
- I clicked on the GO term 3'-5'-exoribuclease activity, which is under the gene VC0647. The expression of the gene was changed significantly, since the color of PNP_VIBCH was green on the MAPP window.
- I then double-clicked on the gene box. This opened an Internet Explorer window called the "Backpage" for this gene. This page has linked to pages for this gene in the public databases. This gene is involved in mRNA degradation. It catalyzes the phosphorolysis of single-stranded polyribonucleotides processively in the 3'- to 5'-direction.
- The MAPP that was created was stored in the directory, C:\GenMAPP 2 Data\MAPPs\VC GO. The mapp file can be located here.
- In Windows, I made a copy of my results (XXX-CriterionX-GO.txt) file.
- "XXX" refers to the name I gave to my results file.
- In my case, "Criterion0" is the criterion for decreased expression and "Criterion1" is for increased since I started out doing the decreased expression.
- The criterion file can be accessed here (decreased) and here.
- I launched Microsoft Excel and opened the copies of the .txt files in Excel. This showed the same data that was in the MAPPFinder Browser, but in tabular form.
- I then looked at the top of the spreadsheet. There are rows of information that gave the background information on how MAPPFinder made the calculations. I compared this information with my partner who used a different version of the Vibrio Gene Database.
- I filtered this list to show the top GO terms represented in my data for both the "Increased" and "Decreased" criteria. I actually managed to filter it down to exactly 20 terms. I used the custom filter by clicking on the drop-down arrow for the column I wished to filter and chose "(Custom…)". A window opened giving me choices on how I wanted to filter. I set these two filters:
Z Score (in column N) greater than 2 PermuteP (in column O) less than 0.05
- I used these filters to narrow down the results to just 20 (decreased):
Number Changed (in column I) greater than or equal to 4 AND less than 100 Percent Changed (in column L) greater than or equal to 37%
- I used these filters to narrow down the results to just 21 (increased):
Percent Changed (in column L) greater than or equal to 26%
- I saved my changes to an Excel spreadsheet and selected File > Save As and selected Excel workbook (.xls) from the drop-down menu.
- When I looked up the terms in the MAPP window, about 50%-75% of the terms were related to each other. The Excel sheet with increased expression can be found here. The sheet for decreased can be found here. One observation I could make is that more terms were related in the decreased than in the increased file.
- I started interpreting these results by looking up the definitions online and on the geneontology website.
From what Merrell et al discussed in their paper, it would seem that the V. cholerae bacteria thrive in the acidic environment that exists in the gastrointestinal tract of humans. However, it would also seem that the lower the pH, the more likely an infection will occur in a human. The results from Merrell et al show that as the pH of the tract increases, the expression of certain genes in V. cholerae also increase in order to allow the organism to survive the harsher conditions. The statistical analysis that we performed for this assignment suggests that this would also be the case from the varying expressions for the human gastrointestinal tract such as the ones that center around the proteins. Changes in gene expression would enable V. cholerae to survive longer even in the pH conditions that exist within a person's stomach. The following GO terms: protein folding, chorismate metabolic process, aromatic amino acid family biosynthetic process, unfolded protein binding, dicarboxylic acid metabolic process, protein-N(PI)-phosphohistidine-sugar phosphotransferase activity, aromatic amino acid family metabolic process, phosphoenolpyruvate-dependent sugar phosphotransferase system, translational elongation, translation elongation factor activity, peptidyl-prolyl cis-trans isomerase activity, cis-trans isomerase activity, endonuclease activity, glucose catabolic process, hexose catabolic process, translation factor activity, nucleic acid binding, translation regulator activity, ribonuclease activity, monosaccharide catabolic process, GTPase activity, suggest that there are protein processes that help the bacteria survive in the enzyme-filled space of our digestive tracts. From here, I believe that these GO terms are related to the pathogenecity of the bacterium by how they describe how the V. cholerae bacteria is somehow able to keep living in the stomach of patients even when the pH fluctuates.
- The gmf file that was created, which is necessary to re-open the results in MAPPFinder, is located here.
Conclusion
- Our class conducted an analysis of raw data of Vibrio cholerae taken from the year 2009 and again for the year 2010. We first had to conduct a statistical analysis of the raw data using Excel in order to scale, center, and normalize the log fold changes of the genes in the raw data and observe their significance. We then used Dr. Dahlquist's post-doc project called GenMAPP, which used our processed data to generate a gene expression profile of the DNA microarray. After using GenMAPP and MAPPFinder to observe how the terms are connected to the experiment with Vibrio cholerae, we compared our results to those already conducted by Merrell et al on the same organism. Observing how our results matched up with theirs, we can argue that the expression changes in the organism are caused by resistance to certain environmental stresses where these organisms are usually found.
List of Files to Upload
The following files have been zipped together here for easy download and access:
- Exceptions file when I imported data into GenMAPP:
.EX.txt
- Expression Dataset file:
.gex
- GO results file:
XXX-CriterionX-GO.txt
- GO results saved as an Excel spreadsheet with filters applied:
.xls
- The MAPP I looked at:
.mapp
- The MAPPFinder GO mappings file:
.gmf
Assignment Links
Weekly Assignments
- Week 1
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- No Week 13 Assignment
- Week 14
- Week 15
Individual Journal Entries
- Week 1 - This is technically the user page.
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- No Week 13 Assignment
- Week 14
- Week 15