Difference between revisions of "Bklein7 Week 9"

From LMU BioDB 2015
Jump to: navigation, search
(Added Gene Database Testing Report Sample outline)
(added apostrophe)
 
(36 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Screenshots==
+
==Files Asked for in the Gene Database Testing Report==
[[File:TallyresultsBK1029.png]]
+
For convenience, all of the files explicitly asked for in the "Gene Database Testing Report" section were compressed together in this file: [[File:Bklein7 Week9 Files.zip]].
[[File:PostgresIDsBK1029.png]]
+
[[File:PostgresIDsBK1029 UPDATED.png]]
+
[[File:XmlpipedbmatchoutputBK1029.png]]
+
[[File:XmlpipedbmatchoutputBK1029 Updated.png]]
+
==Gene Database Testing Report==
+
===Export Informatio=n==
+
  
Version of GenMAPP Builder:
+
==Pre-requisites==
 +
The following set of software was used in the creation and testing of the ''Vibrio cholerae'' gene database:
  
Computer on which export was run:  
+
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
 +
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
 +
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
 +
# Java JDK 1.8 64-bit
 +
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
 +
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
 +
# Microsoft Access for reading .mdb files
  
Postgres Database name:
+
==Gene Database Creation==
 +
===Downloading Data Source Files and GenMAPP Builder===
  
UniProt XML filename (give filename and upload and link to compressed file):
+
*I download the UniProt XML, GOA, and GO OBO-XML files for ''Vibrio cholerae'' along with the GenMAPP Builder program.
* UniProt XML version (The version information can be found at [http://uniprot.org/news the UniProt News Page]):
+
**All files were saved to the folder ''Bklein7_Week9'' on my computer's ThawSpace.
* UniProt XML download link:
+
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
* Time taken to import:
+
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.
** Note:
+
  
GO OBO-XML filename (give filename and upload and link to compressed file):
+
====UniProt XML====
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped):
+
* GO OBO-XML download link:
+
* Time taken to import:
+
* Time taken to process:
+
** Note:
+
  
GOA filename (give filename and upload and link to compressed file):
+
* I went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]):
+
**From there, I navigated to the complete proteome download page for [http://www.uniprot.org/uniprot/?query=organism:243277 Vibrio cholerae serotype O1 (strain ATCC 39315 / El Tor Inaba N16961)].
* GOA download link:
+
** I clicked on the "Download" button at the top of the page above and selected the following options:
* Time taken to import:
+
***"Download all"
** Note:
+
***"XML" from the "Format" drop-down menu
 +
***"Compressed" format
 +
**I extracted the file using [http://www.7-zip.org/ 7-zip].
  
Name of .gdb file (give filename and upload and link to compressed file):
+
====GOA====
* Time taken to export:
+
** Start time:
+
** End time:
+
  
Note:
+
* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
 +
*Within the above site, I navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/46.V_cholerae_ATCC_39315.goa GO annotations for ''Vibrio cholerae''].
 +
**This text file was automatically opened by my browser. Therefore, I had to manually download the file.
  
===TallyEngine===
+
====GO OBO-XML====
  
* Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
+
* I downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
** Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
+
*I extracted the file using [http://www.7-zip.org/ 7-zip].
** Take a screenshot of the results. Upload the image to the wiki and display it on this page.
+
** For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]
+
  
===Using XMLPipeDB match to Validate the XML Results from the TallyEngine===
+
====Downloaded GenMAPP Builder====
  
[[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | Follow the instructions found on this page to run XMLPipeDB match.]]
+
# I downloaded the GenMAPP Builder zip folder: [https://github.com/lmu-bioinformatics/xmlpipedb/releases/download/untagged-bd04fffc4da853fedf30/gmbuilder-3.0.0-build-5.zip Download gmbuilder-3.0.0-build-5.zip].
 +
# I extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].
  
Are your results the same as you got for the TallyEngine?  Why or why not?
+
===Creating the New Database in PostgreSQL===
  
===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
+
* I launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
 +
** On this server, I created a new database: ''V.cholerae_20151027_gmb3build5''.
 +
** I opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
 +
*** I clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
 +
*** I clicked on the Execute Query icon to run this command.
 +
***In viewing the schema for this database, I confirmed that there were 167 tables after running the above command.
  
For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]
+
===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===
  
You can also look for counts at the SQL level, using some variation of a ''select count(*)'' query. This requires some knowledge of which table received what data. Here’s an initial tip: the ''gene/name'' tags in the XML file land in the ''genenametype'' table. A query on this table counting values from this table that were marked as ''ordered locus'' in the XML file matching the pattern ''VC_[0-9][0-9][0-9][0-9]'' would look like this:
+
* To begin, I launched gmbuilder.bat.
 +
* I selected the "Configure Database" option and entered the following information into the fields below:
 +
** Host or address: localhost
 +
** Port number: 5432
 +
** Database name: V.cholerae_20151027_gmb3build5
 +
** Username: postgres
 +
** Password: Welcome1
  
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';
+
===Importing Data into the PostgreSQL Database===
  
In ''pgAdmin III'', you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the ''SQL Editor'' tab, then clicking on the green triangular ''Play'' button to run.
+
*The downloaded data files for ''Vibrio cholerea'' were specified and imported into the database by clicking on the following buttons:
 +
** Selected File > Import UniProt XML...
 +
** Selected File > Import GO OBO-XML...
 +
** Clicked OK to the message asking to process the GO data.
 +
** Selected File > Import GOA...
  
[[Image:Pgadminiii-query.png]]
+
===Exporting a GenMAPP Gene Database (.gdb)===
  
Are your results the same as reported by the TallyEngine?  Why or why not?
+
* I selected File > Export to GenMAPP Gene Database... to begin the export process.
 +
* I typed my name in the owner field (Brandon Klein).
 +
* I selected''Vibrio cholerae'' as the gene database species and then clicked ''Next''.
 +
* The database was saved as ''Vc-Std_BK_20151027''.
 +
* I checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
 +
* Finally, I clicked the "Next" button to begin the export process.
  
===OriginalRowCounts Comparison===
+
==Gene Database Testing Report==
  
Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.
+
===Export Information===
  
Benchmark .gdb file:
+
Version of GenMAPP Builder: Version 3.0.0 Build 5
  
Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:
+
Computer on which export was run: Seaver 120- Last computer on the right in the row closest to the front of the room
  
Note:
+
Postgres Database name: ''V.cholerae_20151027_gmb3build5''
  
===Visual Inspection===
+
UniProt XML filename: [[Media:Uniprot-organism-243277.zip|uniprot-organism-243277_BK_20151027.xml]]
 +
* UniProt XML version: UniProt release 2015_10 (version information can be accessed on [http://uniprot.org/news the UniProt News Page])
 +
* UniProt XML download link: http://www.uniprot.org/uniprot/?query=organism:243277
 +
* Time taken to import: 2.92 minutes
 +
** Note: The import time was similar to my homework partner's, and no interruptions occurred during this process.
  
Perform visual inspection of individual tables to see if there are any problems.
+
GO OBO-XML filename: [[Media:Go daily-termdb.zip|go_daily-termdb_BK_20151027.obo-xml]]
 +
* GO OBO-XML version: File last modified on ‎Tuesday, ‎October ‎27, ‎2015, ‏‎2:24:02 AM (accessed from the file properties window)
 +
* GO OBO-XML download link: http://geneontology.org/page/download-ontology#Legacy_Downloads
 +
* Time taken to import: 6.88 minutes
 +
* Time taken to process: 4.49 minutes
 +
** Note: The import and processing times were once again similar to my homework partner's, and no interruptions occurred during these processes.
 +
GOA filename: [[Media:46.V cholerae ATCC 39315.zip|46.V_cholerae_ATCC_39315_BK_20151027.goa]]
 +
* GOA version: File last modified on Wednesday, October 14, 2015 (accessed on the [http://www.ebi.ac.uk/GOA/ UniProt-GOA news page])
 +
* GOA download link: http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/46.V_cholerae_ATCC_39315.goa
 +
* Time taken to import: 0.06 minutes
 +
** Note: This import process was equally quick for my homework partner and presented no complications.
 +
Name of .gdb file: [[Media:Vc-Std 20151027.zip|Vc-Std_BK_20151027.gdb]]
 +
* Time taken to export: 1 hour, 27 minutes
 +
** Start time: 4:44 PM (restarted export after class)
 +
** End time: 6:11 PM
 +
**Note: Upon checking on my export progress after class, I found out that the export had prematurely stopped. Therefore, I restarted the process. When I returned to the S120 computer lab on 10/29, the export had completed successfully and all windows remained open.
  
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
+
===TallyEngine===
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
+
  
Note:
+
* I ran the Tally Engine in GenMAPP Builder to process the number of records for UniProt and GO in the XML data and in the Postgres databases. This was done to verify that the XML data transferred consistently into the PostgreSQL database.
 +
** I selected the menu item Tallies > Run XML and Database Tallies for UniProt and GO... to run the Tally Engine. The results are pictured below:
 +
**[[File:TallyresultsBK1029.png]]
 +
***XML and database counts were consistent among all assessed data types.
 +
***Note: the "Ordered Locus" count was 3831.
  
===.gdb Use in GenMAPP===
+
===Using XMLPipeDB match to Validate the XML Results from the TallyEngine===
  
Note:
+
On the command line (''cmd'' on Windows), I entered the folder ''Bklein7_Week9'' to access the XML data files. The first match I ran was designed to identify matches of any ordered locus name following the pattern VC_#### in the UniProt XML file. The command sequence used is as follows:
  
===Putting a gene on the MAPP using the GeneFinder window===
+
java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml" >     
 +
OrderedLocusNames.txt
  
* Try a sample ID from each of the gene ID systems. Open the Backpage and see if all of the cross-referenced IDs that are supposed to be there are there.
+
The results of this match are pictured below.
 +
*[[File:XmlpipedbmatchoutputBK1029.png]]
  
Note:
+
*The number of unique matches yielded by XMLPipeDB Match, 2738, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries.
 +
*I investigated this discrepancy by opening the gene database file ''Vc-Std_BK_20151027.gdb'' in Microsoft Access. Within Access, I opened the "OrdereLocusNames" table.
 +
**Quickly skimming through this tablerevealed a large number of Ordered Locus entries that did not follow the pattern VC_####. Instead, these entries followed the pattern VC_A####. Therefore, I edited the pattern I input into XMLPipeDB Match.
  
===Creating an Expression Dataset in the Expression Dataset Manager===
+
The second match I ran accounted for the issues above. The command sequence used is as follows:
  
* How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?
+
java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml" 
 +
> OrderedLocusNames.txt
  
Note:
+
The results of this match are pictured below.
 +
*[[File:XmlpipedbmatchoutputBK1029 Updated.png]]
  
===Coloring a MAPP with expression data===
+
*With this updated pattern, XMLPipeDB yielded 3831 unique matches. This matched the Tally Engine results.
 +
**Upon accounting for the alternate Ordered Locus entry pattern, XMLPipeDB validated the results of Tally Engine by calculating the same number of entries in the source data files used in the SQL database.
  
Note:
+
===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
  
===Running MAPPFinder===
+
The ''gene/name'' tags in the XML file land in the ''genenametype'' table, which contains ''ordered locus'' values. As in XMLPipeDB Match, I started with a SQL query designed to match the pattern VC_####:
  
Note:
+
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';
  
 +
The results of this query are pictured below.
 +
[[File:PostgresIDsBK1029.png]]
  
===Compare Gene Database to Outside Resource===
+
*The number of unique matches yielded by this SQL query, 2737, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries. Further, it had one less entry than the original XMLPipeDB query.
 +
**As with the first XMLPipeDB match, the main issue here was the fact that I did not account for the occasional presence of the letter "A" in the Ordered Locus entries.
  
The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML.  It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.)  Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.
+
I ran a second SQL query to include the pattern VC_A#### in the count output:
  
Note:
+
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';
  
==Pre-requisites==
+
The results of this query are pictured below.
This tutorial assumes that you are working in a Windows environment. While it is possible to run GenMAPP Builder under the Mac or Linux OS, the end product, a GenMAPP-compatible Gene Database (.gdb), can only be used with the GenMAPP program, which can only be run on Windows.  [[Software_Configuration | This set of software has already been installed on the computers in the Seaver 120 computer lab.]] If you want to perform this procedure on your own machine, you must set up your working environment with:
+
[[File:PostgresIDsBK1029 UPDATED.png]]
  
# Any tool that can unpack .gz and .zip files
+
*With the secondary pattern included, this SQL query yielded 3831 unique matches. This matched the Tally Engine results as well as the updated XMLPipeDB Match results.
#* We use [http://www.7-zip.org/ 7-zip]
+
**These findings further validated the Tally Engine results.
#* Note that we have found that the native Windows utility cannot reliably unpack .gz files or .zip files containing .jar files.
+
# [http://www.postgresql.org PostgreSQL] on Windows (http://www.enterprisedb.com/products-services-training/pgdownload)
+
#* This tutorial was written using PostgreSQL 9.4.x.
+
# GenMAPP Builder (https://sourceforge.net/projects/xmlpipedb/files/)
+
# Java JDK 1.8 64-bit
+
#* [http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html Download page]
+
#* File to download is: jdk-8u65-windows-x64.exe
+
# GenMAPP 2 can be downloaded [https://github.com/GenMAPPCS/genmapp here].  The file to download is "GenMAPPv2Setup.exe".
+
# XMLPipeDB ''match'' utility (https://sourceforge.net/projects/xmlpipedb/files/) for counting IDs in XML files
+
# Microsoft Access or any other tool that can read .mdb files
+
  
==Download and Extract Data Source Files==
+
===OriginalRowCounts Comparison===
  
*Download UniProt XML, GOA, and GO OBO-XML files.
+
I opened the gene database file [[Media:Vc-Std 20151027.zip|Vc-Std_BK_20151027.gdb]] in Microsoft Access to see if the the expected tables were listed with the expected number of records. This information was accessed in the ''OriginalRowCounts'' table.
 +
*A listing named "OrderedLocusNames" was present.
 +
*The table reported 7664 distinct Ordered Locus Names.
 +
**This value is equal to (3831*2)+1
  
=== UniProt XML===
+
Benchmark .gdb file: [[File:Vc-Std External 20101022.gdb]]
  
* Go to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
+
As a sanity check, I compared the information in this table to the ''OriginalRowCounts'' table present in the 2010 Vibrio cholerae gene database that we worked with last week.
** Browse to the complete proteome download page for your species of interest.  For example, to get to ''Vibrio cholerae'' page, first filter the list by clicking on the link for [http://www.uniprot.org/taxonomy/?query=*&fil=ancestor%3A%22Bacteria+%5B2%5D%22+AND+complete%3Ayes "Bacteria"] under the "Superkingdom" heading.
+
*The "OrderedLocusNames" listing was also present here.
** Further filter the results for those species with a [http://www.uniprot.org/taxonomy/?query=*&fil=reference%3Ayes+AND+complete%3Ayes+AND+ancestor%3A%22Bacteria+%5B2%5D%22 "Reference proteome"].
+
*The table reported the same number of distinct Ordered Locus Names: 7664.
** Scroll through the results (you might need to click through several pages) until you find your organism of interest, e.g. ''Vibrio cholerae'' serotype O1 (strain ATCC 39315 / El Tor Inaba N16961).
+
*Although the above value was the same, row counts for various most of the other tables such as "GeneOntologyTree" dramatically increased between the 2010 and 2015 databases.
** Click on the link for "UniProtKB", e.g. [http://www.uniprot.org/uniprot/?query=organism:243277 Vibrio cholerae serotype O1 (strain ATCC 39315 / El Tor Inaba N16961)].
+
*Overall, 10 more tables were listed in the 2015 database.
** Click the "Download" button at the top of the page.  It will open a small dialog box.  You need to select the following options:
+
*** Select the radio button to "Download all"
+
*** Choose "XML" from the "Format" drop-down menu.
+
*** Select the radio button for "Compressed" format.
+
*** Click the "Go" button.
+
  
===GOA===
+
OriginalRowCounts table from the benchmark and new gdb:
 +
*[[File:OriginalRowCounts ComparisonImage.PNG]]
  
* [http://www.ebi.ac.uk/GOA This is the UniProt-GOA home page].
+
Note: The ''OriginalRowCounts'' tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the ''OriginalRowCounts'' tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row counts are highlighted in yellow.
** The current and previous UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
+
** In the directory that appears, click the link to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ "proteomes" directory].
+
*** Note that it may take some time to load this page.
+
** Find your organism of interest and right-click on the link to download the GO annotations and select "Save target as" or "Save link as" and save the GOA file.  For example, [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/46.V_cholerae_ATCC_39315.goa this is the link] for ''Vibrio cholerae''.
+
*** Note:  Since the GOA file is a text file, your browser will not automatically download it when you left-click on the link. Instead, it will try to open the file in your browser window. Since it is a large file, this could take a long time if your internet connection is slow.
+
*** The version information can be found on displayed in the ftp file directory under the "Last modified" column.  You will need this information for your Gene Database Testing Report.
+
  
===GO OBO-XML===
+
===Visual Inspection===
  
* Download the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology download page].  Click on the link for "obo-xml.gz" under the heading "Legacy Downloads."
+
I visually inspected individual tables within the gene database using Microsoft Access to see if there were any problems.
** This file is updated daily.  You can get the day/time that the file was created from the file properties after you have unzipped the file.
+
  
===Extract the UniProt XML and GO OBO-XML files===
+
* First, I looked at the Systems table. This table includes columns detailing 35 different gene ID systems from which data was accessed.
 +
**Of the 35 gene ID systems, only 11 have dates listed in the date column.
 +
*** Those that do have values entered here all share the same date: 10/27/2015. This was the day on which the gene database was created, suggesting that these entries serve as markers for when the raw data was accessed from the gene ID systems' sites.
 +
*** The fact that the other 24 gene ID systems do not have dates listed is problematic, as older versions of the data stored on these systems may have been accessed without this verification stamp.
 +
* Next, I opened the UniProt, RefSeq, and OrderedLocusNames tables to see if the listed IDs were listed in the correct forms.
 +
**UniProt
 +
***The UniProt ID column includes accession numbers from the UniProt gene entries. The different possibilities for UniProt accession numbers are detailed on the UniProt website [http://www.uniprot.org/help/accession_numbers here].
 +
***All ID's in the UniProt table conform to the following pattern (originally found on the page linked to above): [[File:UniProt Ascension Number info.PNG]]
 +
***Thus, there are no issues with these 3789 gene IDs.
 +
**RefSeq
 +
***The RefSeq ID's are series of numbers with one of three prefixes: "NP_", "YP_", and "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
 +
***"NP_" and "YP_" Prefixes
 +
****Refer to proteins. There are 3329 ID's with these prefixes (3328 for NP_ and 1 for YP_).
 +
***"WP_" Prefixes
 +
****Refer to " autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3328 ID's with these prefixes.
 +
***Overall, every entry in the ID column was an expected value.
 +
**OrderedLocusNames
 +
***7664 ID entries were present that fell into 4 different patterns:
 +
****VC_####
 +
****VC_A####
 +
****VC####
 +
****VCA####
 +
***This ID column was analyzed in Microsoft Excel to investigate why there were 7664 entries present (as opposed to the consistent count of 3831 in SQL and XMLPipeDB Match).
 +
****The analysis in Excel revealed that the variations of the the gene IDs with and without underscores were replicates of one another. Presumably, these gene IDs were duplicated during the process of exporting the gene database to account for a formatting issue pertaining to underscores. Regardless, this discovery means that only half of the 7664 entries were unique gene IDs. This leaves us with '''3832''' unique IDs, which is still one more than was expected.
 +
****Further analysis in Excel revealed that both the XML and SQL gene ID's were off by one due to errors, resulting in the erroneous count of 3831. The XML entries did not include the gene ID '''VC_A0360.1''' (the period presented a formatting issue). The SQL entries did include VC_A0360.1, but an error occurred where two gene IDs were merged into one: '''VC_1738/VC_1739'''. The overall number of unique IDs is 3832 when these errors errors are taken into consideration.
  
* Extract the UniProt XML and GO OBO-XML .gz files using [http://www.7-zip.org/ 7-zip] or other utility.
+
Note: The Excel analysis performed above was done following the procedure present on this page- [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/Using_Microsoft_Excel_to_Compare_ID_Lists Using Microsoft Excel to Compare ID Lists]. The Excel file I used for the analysis itself can be accessed here: [[File:OrderedLocusNames InvestigationBK1029.xlsx]].
  
==Download or Update GenMAPP Builder==
+
==Vc-Std_BK_20151027.gdb Use in GenMAPP==
  
# Visit the [https://github.com/lmu-bioinformatics/xmlpipedb/releases XMLPipeDB releases page on GitHub].
+
Note: The ability of ''Vc-Std_BK_20151027.gdb'' to be processed was tested in GenMAPP Version 2.1. Before conducting any of the testing below, ''Vc-Std_BK_20151027.gdb'' was selected as the "Gene Database" withing GenMAPP.  
#* [https://github.com/lmu-bioinformatics/xmlpipedb/releases/download/untagged-bd04fffc4da853fedf30/gmbuilder-3.0.0-build-5.zip Download gmbuilder-3.0.0-build-5.zip].
+
# Extract the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip] or other utility.
+
#* We suggest that you move the extracted folder to the "T:" drive on the computers in Seaver 120.  This folder is a "Thawspace", the contents of which will not be deleted by the program Deep Freeze when the computer is restarted.
+
  
==Create New Database in PostgreSQL==
+
===Putting a gene on the MAPP using the GeneFinder window===
 +
To figure out how to perform this function, I referenced the "Gene Finder" tab within the "GenMAPP Help" window.
  
NOTE: if you have already performed this step and want to use GenMAPP Builder functions with a database you previously created in PostgreSQL, you can skip this step.
+
*The following sample IDs were taken from different gene ID systems and tested:
 +
**UniProt- Q9KQM0
 +
***The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
 +
***All expected cross-referenced IDs were present.
 +
***[[File:UniProt Gene Backpage.png]]
 +
**RefSeq- NP_230721
 +
***The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
 +
***All expected cross-referenced IDs were present.
 +
***[[File:Refseq Gene Backpage.png]]
 +
**OrderedLocusNames- VC_0274
 +
***The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
 +
***All expected cross-referenced IDs were present.
 +
***[[File:Orderedlocusnames Gene Backpage.png]]
 +
**GeneID- 2612551
 +
***The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
 +
***All expected cross-referenced IDs were present.
 +
***[[File:GeneID Gene Backpage.png]]
 +
**EnsemblBacteria- AAF96503
 +
***The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
 +
***All expected cross-referenced IDs were present.
 +
***[[File:EnsemblBacteria Gene Backpage.png]]
 +
*Screenshot of all of the sample ID's on a MAPP:
 +
**[[File:All genes on MAPP.png]]
  
* Launch pgAdmin III.
+
Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.
* Double-click on PostgreSQL 9.4 (localhost:5432) on the upper left hand side of the window.
+
** This is the equivalent of connecting you to the server and you may be asked for a password at this point.
+
* Right click on "Databases" and Select "New Database..."
+
* Give the database a name in the "Name" field and click OK.
+
** Take some care in selecting a meaningful name.  It is good practice to at least include the species and today's date in the name.
+
* Double-left-click on your new database name in the treeview on the left.
+
* Click on the SQL icon in the toolbar at the top of the window.
+
** The SQL Editor tab will be open and there may be leftover query text in the upper pane.  Delete this text.  You are now going to use an XMLPipeDB query to create the tables in the database.
+
* Click on the Open File icon in the toolbar (the yellow folder with an arrow).
+
* Navigate to the folder in which you unzipped GenMAPP Builder.
+
* Open the ''sql'' folder and open the file ''gmbuilder.sql''.  You should see SQL code appear in the SQL Editor tab.
+
* Click the Execute Query icon which looks like a green "Play" triangle button.
+
* You should get a series of NOTICE messages in the Messages tab at the bottom of the window, concluding with a message like "Query returned successfully with no result in 15583 ms" in the end.  This query now created all the tables in the database (although there is still no data in them).
+
* Close the query window (you don't need to save the query because you have already run it).
+
* To double check that all is OK, click the + sign for the database, then the + sign for Schemas, then finally the + sign for public.  Under the Tables section, you should see a count of 167 in parentheses.
+
  
==Configuring GenMAPP Builder to Connect to your PostgreSQL Database==
+
===Creating an Expression Dataset in the Expression Dataset Manager===
 +
The file [[File:Merrell Compiled Raw Data Vibrio BK 20151015- Tab Delimited.txt]] from my Week 8 assignment page was used to create a new Expression Dataset.
  
* Launch gmbuilder.bat.
+
* How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression DatasetDo these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?
** If the program does not detect a database configuration, you will see a message window to this effect and the configuation dialog will open automatically once you close the message window.  Otherwise:
+
* Select the menu item File > Configure Database...
+
* Under the Database Connections tab the Database Driver defaults to PostgreSQLEnter information in the following fields:
+
** Host or address: localhost
+
** Port number: 5432
+
** Database name: <enter the name of the PostgreSQL database you created above>
+
** Username: <enter the username of the PostgreSQL database you created above>; in S120, this username is "postgres"
+
** Password: <enter the password of the PostgreSQL database you created above>; in S120, ask the instructors for the password.
+
* Click the OK button.
+
  
==Importing Data into the PostgreSQL Database==
+
*5100 of the 5221 gene IDs from the original text file were imported into the Expression Dataset.
 +
*There were 121 exceptions.
 +
*Within the EX.txt file, the error message "Gene not found in OrderedLocusNames or any related system." was listed 121 times.
 +
*Two examples of a genes that encountered the above error message were the genes with the following OrderedLocusNames (in the EX.txt file): VC2209 and VCA1031.
 +
**When both of these IDs were searched in the Microsoft Access "OrderedLocusNames" table, no results were retrieved.
 +
**When both of these IDs were searched in the XMLPipeDB Match results file created earlier in this protocol, no results were retrieved.
 +
**Consequentially, it does not appear that the genes that retrieved error messages were present in the original UniProt XML file.
 +
***This finding would be explain the observation that individuals using the 2010 V. cholerae gene database last week, such as my Week 8 partner Veronica, also retrieved 121 errors during this step.
  
* Select File > Import UniProt XML...
+
Note: 121 errors were encountered during the creation of this new Expression Dataset. A screenshot of the error message is shown here:
** Navigate to the UniProt XML file that you extracted previously and click the Open button.
+
*[[File:GenMAPP Errors 20151102.png]]
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
+
***Import Time: 2.92 minutes
+
* Select File > Import GO OBO-XML...
+
** Navigate to the GO OBO-XML file that you extracted previously.  Click the Open button.
+
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine. When the process has completed, record the elapsed time from the message window that appears.
+
***Import Time: 6.88 minutes
+
* Click OK to the message asking you to process the GO data.
+
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
+
***Processing Time: 4.49 minutes
+
* Select File > Import GOA...
+
** Navigate to the GOA file that you downloaded previously and click the Import button.  This process should only take a minute or so.
+
***Import Time: 0.05 minutes
+
==Exporting a GenMAPP Gene Database (.gdb)==
+
  
* Select File > Export to GenMAPP Gene Database...
+
===Coloring a MAPP with expression data===
* Type a name in the Owner field (or else it won't let you export).
+
** When doing the individual exercise for the [[Week 9]] assignment, use your own name.  When doing this for your team project, use your team's name.
+
* GenMAPP Builder scans your PostgreSQL database to see what species are available.  Click on the species that you would like to export, then click ''Next'' to continue.
+
* Create GenMAPP Database: click on the "Save GenMAPP Database File As..." button.
+
** In the Save dialog box that appears, navigate to the "T:" drive, and then modify the default file name by appending your initials.  Click the "Save" button
+
* Leave the boxes checked for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
+
* Click the "Next" button to begin teh export process.
+
** Record the starting and ending times from the black console window.  This will take 1-2 hours for a typical bacterial genome, depending on the size of the database, the processor speed, and available memory.  Large eukaryotic genomes (like ''Arabidopsis thaliana'') or genomes with many GO annotations (like ''Saccharomyces cerevisiae'') can take much longer, in the range of 12-24 hours.
+
**NOTE:  the progress bar is not accurate.
+
***Start Time: 4:44 PM (restarted export after class)
+
***End Time: 6:11 PM
+
***Elapsed Time: 1 hour, 27 minutes
+
  
==Checking the Quality of your Exported Gene Database==
+
* I customized the new Expression Dataset by creating a Color Set. The new Color Set was entitled "LogFoldChange".
 +
**First, I created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
 +
***I specified the Gene value as "Avg_LogFC_all" for the Vibrio dataset.
 +
***I activated the Criteria Builder by clicking the New button and named the criterion "Increased".
 +
***I selected the color for this criterion as red using the color box.
 +
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>
 +
**Second, I created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
 +
***I specified the Gene value as "Avg_LogFC_all" for the Vibrio dataset.
 +
***I activated the Criteria Builder by clicking the New button and named the criterion "Decreased".
 +
***I selected the color for this criterion as green using the color box.
 +
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
 +
* Upon entering these color sets, I saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated my .gex file with the new Color Set.
  
* It is a good idea to check the quality of your exported Gene Database to make sure that all of the data from the XML files made it into the PostgreSQL database and was then exported to the GenMAPP Gene Database.  We have created a [[Gene Database Testing Report Sample]] to help guide you through this process.
+
Screenshot of Color Set criteria:
== Tally Engine ==
+
*[[File:Colorset.png]]
  
The first tool, called the Tally Engine, can be used for verifying that certain data from the XML file transferred consistently into the PostgreSQL database upon import.  The Tally Engine can be found in GenMAPP Builder itself.
+
Note: No errors were encountered in the creation of the Color Set.
  
# Run PostgreSQL (via pgAdmin III on Windows) and make sure that your database is up and running.
+
===Running MAPPFinder===
# Run GenMAPP Builder and make sure that it is connected to the database (via ''Configure Database...'').
+
# After performing an import, choose ''Run XML and Database Tallies for UniProt and GO...''.
+
# Choose the UniProt and GO files that you imported.
+
# You should see a table for selected data items, and how many of each were found.
+
  
[[Image:Tally-results.png]]
+
* I launched the MAPPFinder program from within GenMAPP and ensured that the ''Vc-Std_BK_20151027.gdb'' Gene Database was still loaded into GenMAPP.
 +
* I clicked on the button "Calculate New Results" followed by "Find File", at which point I the .gex file updated during the creation of the "LogFoldChange" color set.
 +
* I chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange Color Set to the data.
 +
* I checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
 +
**This analysis took several minutes to complete.
 +
*MAPPFinder analysis results
 +
**Initial result in MAPPFinder Browser:
 +
***[[File:MAPPFinder Result BK20151103.png]]
 +
**I clicked on the menu item "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
 +
**[[File:MAPPFinder RankedList BK20151103.png]]
 +
***As an aside, it is interesting to note that few of the top 10 gene ontology results from this MAPPFinder analysis were conserved from my analysis using the 2009 ''Vibrio cholerae'' gene database. This suggests the high degree of change and progress made in genomics in the last six years.
  
Under the hood, the Tally Engine bases its XML counts on certain XML tags, and bases its database counts on SQL queries using ''count''. This tool is thus primarily useful for making sure that the &ldquo;raw&rdquo; import worked without any errors or glitches.
+
Note: The MAPPFinder analysis took approximately 16 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the gene database created this week.
  
My Tally Results:
+
===Compare Gene Database to Outside Resource===
 
+
I will get back to this question during the Group Final Project, as was allowed:
[[File:TallyresultsBK1029.png]]
+
  This section applies to the Group Final Project and does not need to be completed for the  
 
+
  Week 9 assignment. Kdahlquist (talk) 15:46, 2 November 2015 (PST)
== XMLPipeDB Match ==
+
The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XMLIt is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.
 
+
XMLPipeDB Match is useful for counting data in files.  Thus, in our context, you would use XMLPipeDB Match to tally stuff in XML files, with greater flexibility than with Tally Engine.
+
 
+
You will have to use XMLPipeDB Match from the command line.  In addition, you can use this in any platform (as you have seen).  Download the application from the [http://sourceforge.net/projects/xmlpipedb XMLPipeDB SourceForge site] and take note of the location of the ''xmlpipedb-match-1.1.1.jar''. Then, on the command line (''Terminal'' for Linux and Mac OS X, ''cmd'' on Windows), ''cd'' to the folder containing the XML file that you would like to check.  Use XMLPipeDB Match as follows, with the parts in parentheses varying depending on your specific setup, desired pattern, and file being scanned:
+
 
+
java -jar (location-of-jar) "(pattern)" < (XML file)
+
 
+
On a Windows machine, with XMLPipeDB Match and a ''Vibrio cholerae'' XML file located on the Desktop, scanning for IDs of the form '''VC_####''', where '''#''' represents a digit from 0 to 9, one would type, after ''cd''-ing to the Desktop:
+
 
+
  java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < uniprot-taxonomy%3A243277.xml
+
 
+
As you have seen before, this will give you a list of unique matches, with a total number at the bottom.
+
 
+
The trick with XMLPipeDB Match is to use the patterns well: with the database project, you will mainly be matching IDsA desired count is an XMLPipeDB Match result whose matched ID pattern corresponds to the number of IDs found by the Tally Engine.
+
 
+
== SQL ==
+
 
+
You can also look for counts at the SQL level, using some variation of a ''select count(*)'' query.  This requires some knowledge of which table received what data.  Here&rsquo;s an initial tip: the ''gene/name'' tags in the XML file land in the ''genenametype'' table.  A query on this table counting values from this table that were marked as ''ordered locus'' in the XML file matching the pattern ''VC_[0-9][0-9][0-9][0-9]'' would look like this:
+
 
+
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';
+
 
+
In ''pgAdmin III'', you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the ''SQL Editor'' tab, then clicking on the green triangular ''Play'' button to run.
+
 
+
[[Image:pgadminiii-query.png]]
+
 
+
== Microsoft Access ==
+
 
+
For the GenMAPP Gene Database, you can open the ''.gdb'' in Microsoft Access and navigate its tables to find counts for various IDsOpening the table, noting its size, and doing some sorting may help.  You can also look at the ''OriginalRowCounts'' table for a summary of totals.
+
 
+
Again, the ideal situation is a correspondence in these numbers with what you found in XML and the relational database.
+
 
+
== Back to the Command Line ==
+
 
+
Amidst all this, you can still use ''grep'' and ''wc'' on the command line for some basic counting.  Just remember that these tools work on a line-by-line basis; useful in some cases, but not useful in others.
+
 
+
You can use ''grep'' and ''wc'' with the various files on the ''my.cs.lmu.edu'' server by using the ''curl -O'' command shown in the [[Week 6]] assignment. Upload your data files to the wiki, place media links to them on your wiki page, then mouse over those live links to capture their URL ([[Media:Movie.sql.txt|like this—look at the source to see the wiki markup]]), then use ''curl -O (whatever-the-url-is)'' while ''ssh''-ed to ''my.cs.lmu.edu'' to bring that file into the server.
+
 
+
* If the file is a ''.zip'' file, you can use ''unzip'' at the command line to unzip it.
+
* If the file is a ''.gz'' file, you can use ''gunzip'' at the command line to uncompress that one.
+
 
+
<!-- Future content: advanced techniques for diff'ing grep/match results against psql results, duly processed by sed. -->
+
  
 
==Links==
 
==Links==

Latest revision as of 22:31, 23 November 2015

Files Asked for in the Gene Database Testing Report

For convenience, all of the files explicitly asked for in the "Gene Database Testing Report" section were compressed together in this file: File:Bklein7 Week9 Files.zip.

Pre-requisites

The following set of software was used in the creation and testing of the Vibrio cholerae gene database:

  1. 7-ziptool that for unpacking .gz and .zip files
  2. PostgreSQL on Windows (version 9.4.x)
  3. GenMAPP Builder
  4. Java JDK 1.8 64-bit
  5. GenMAPP 2
  6. XMLPipeDB match utility for counting IDs in XML files
  7. Microsoft Access for reading .mdb files

Gene Database Creation

Downloading Data Source Files and GenMAPP Builder

  • I download the UniProt XML, GOA, and GO OBO-XML files for Vibrio cholerae along with the GenMAPP Builder program.
    • All files were saved to the folder Bklein7_Week9 on my computer's ThawSpace.
    • Files that required extraction were unzipped using 7-zip.
    • Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

UniProt XML

GOA

GO OBO-XML

Downloaded GenMAPP Builder

  1. I downloaded the GenMAPP Builder zip folder: Download gmbuilder-3.0.0-build-5.zip.
  2. I extracted the GenMAPP Builder folder using 7-zip.

Creating the New Database in PostgreSQL

  • I launched pgAdmin III and connected to the PostgreSQL 9.4 server (localhost:5432).
    • On this server, I created a new database: V.cholerae_20151027_gmb3build5.
    • I opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
      • I clicked on the Open File icon and selected the file gmbuilder.sql. This imported a series of SQL commands into the editor tab.
      • I clicked on the Execute Query icon to run this command.
      • In viewing the schema for this database, I confirmed that there were 167 tables after running the above command.

Configuring GenMAPP Builder to Connect to the PostgreSQL Database

  • To begin, I launched gmbuilder.bat.
  • I selected the "Configure Database" option and entered the following information into the fields below:
    • Host or address: localhost
    • Port number: 5432
    • Database name: V.cholerae_20151027_gmb3build5
    • Username: postgres
    • Password: Welcome1

Importing Data into the PostgreSQL Database

  • The downloaded data files for Vibrio cholerea were specified and imported into the database by clicking on the following buttons:
    • Selected File > Import UniProt XML...
    • Selected File > Import GO OBO-XML...
    • Clicked OK to the message asking to process the GO data.
    • Selected File > Import GOA...

Exporting a GenMAPP Gene Database (.gdb)

  • I selected File > Export to GenMAPP Gene Database... to begin the export process.
  • I typed my name in the owner field (Brandon Klein).
  • I selectedVibrio cholerae as the gene database species and then clicked Next.
  • The database was saved as Vc-Std_BK_20151027.
  • I checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
  • Finally, I clicked the "Next" button to begin the export process.

Gene Database Testing Report

Export Information

Version of GenMAPP Builder: Version 3.0.0 Build 5

Computer on which export was run: Seaver 120- Last computer on the right in the row closest to the front of the room

Postgres Database name: V.cholerae_20151027_gmb3build5

UniProt XML filename: uniprot-organism-243277_BK_20151027.xml

GO OBO-XML filename: go_daily-termdb_BK_20151027.obo-xml

  • GO OBO-XML version: File last modified on ‎Tuesday, ‎October ‎27, ‎2015, ‏‎2:24:02 AM (accessed from the file properties window)
  • GO OBO-XML download link: http://geneontology.org/page/download-ontology#Legacy_Downloads
  • Time taken to import: 6.88 minutes
  • Time taken to process: 4.49 minutes
    • Note: The import and processing times were once again similar to my homework partner's, and no interruptions occurred during these processes.

GOA filename: 46.V_cholerae_ATCC_39315_BK_20151027.goa

Name of .gdb file: Vc-Std_BK_20151027.gdb

  • Time taken to export: 1 hour, 27 minutes
    • Start time: 4:44 PM (restarted export after class)
    • End time: 6:11 PM
    • Note: Upon checking on my export progress after class, I found out that the export had prematurely stopped. Therefore, I restarted the process. When I returned to the S120 computer lab on 10/29, the export had completed successfully and all windows remained open.

TallyEngine

  • I ran the Tally Engine in GenMAPP Builder to process the number of records for UniProt and GO in the XML data and in the Postgres databases. This was done to verify that the XML data transferred consistently into the PostgreSQL database.
    • I selected the menu item Tallies > Run XML and Database Tallies for UniProt and GO... to run the Tally Engine. The results are pictured below:
    • TallyresultsBK1029.png
      • XML and database counts were consistent among all assessed data types.
      • Note: the "Ordered Locus" count was 3831.

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

On the command line (cmd on Windows), I entered the folder Bklein7_Week9 to access the XML data files. The first match I ran was designed to identify matches of any ordered locus name following the pattern VC_#### in the UniProt XML file. The command sequence used is as follows:

java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml" >       
OrderedLocusNames.txt

The results of this match are pictured below.

  • XmlpipedbmatchoutputBK1029.png
  • The number of unique matches yielded by XMLPipeDB Match, 2738, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries.
  • I investigated this discrepancy by opening the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access. Within Access, I opened the "OrdereLocusNames" table.
    • Quickly skimming through this tablerevealed a large number of Ordered Locus entries that did not follow the pattern VC_####. Instead, these entries followed the pattern VC_A####. Therefore, I edited the pattern I input into XMLPipeDB Match.

The second match I ran accounted for the issues above. The command sequence used is as follows:

java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml"   
> OrderedLocusNames.txt

The results of this match are pictured below.

  • XmlpipedbmatchoutputBK1029 Updated.png
  • With this updated pattern, XMLPipeDB yielded 3831 unique matches. This matched the Tally Engine results.
    • Upon accounting for the alternate Ordered Locus entry pattern, XMLPipeDB validated the results of Tally Engine by calculating the same number of entries in the source data files used in the SQL database.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

The gene/name tags in the XML file land in the genenametype table, which contains ordered locus values. As in XMLPipeDB Match, I started with a SQL query designed to match the pattern VC_####:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';

The results of this query are pictured below. PostgresIDsBK1029.png

  • The number of unique matches yielded by this SQL query, 2737, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries. Further, it had one less entry than the original XMLPipeDB query.
    • As with the first XMLPipeDB match, the main issue here was the fact that I did not account for the occasional presence of the letter "A" in the Ordered Locus entries.

I ran a second SQL query to include the pattern VC_A#### in the count output:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';

The results of this query are pictured below. PostgresIDsBK1029 UPDATED.png

  • With the secondary pattern included, this SQL query yielded 3831 unique matches. This matched the Tally Engine results as well as the updated XMLPipeDB Match results.
    • These findings further validated the Tally Engine results.

OriginalRowCounts Comparison

I opened the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access to see if the the expected tables were listed with the expected number of records. This information was accessed in the OriginalRowCounts table.

  • A listing named "OrderedLocusNames" was present.
  • The table reported 7664 distinct Ordered Locus Names.
    • This value is equal to (3831*2)+1

Benchmark .gdb file: File:Vc-Std External 20101022.gdb

As a sanity check, I compared the information in this table to the OriginalRowCounts table present in the 2010 Vibrio cholerae gene database that we worked with last week.

  • The "OrderedLocusNames" listing was also present here.
  • The table reported the same number of distinct Ordered Locus Names: 7664.
  • Although the above value was the same, row counts for various most of the other tables such as "GeneOntologyTree" dramatically increased between the 2010 and 2015 databases.
  • Overall, 10 more tables were listed in the 2015 database.

OriginalRowCounts table from the benchmark and new gdb:

  • OriginalRowCounts ComparisonImage.PNG

Note: The OriginalRowCounts tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the OriginalRowCounts tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row counts are highlighted in yellow.

Visual Inspection

I visually inspected individual tables within the gene database using Microsoft Access to see if there were any problems.

  • First, I looked at the Systems table. This table includes columns detailing 35 different gene ID systems from which data was accessed.
    • Of the 35 gene ID systems, only 11 have dates listed in the date column.
      • Those that do have values entered here all share the same date: 10/27/2015. This was the day on which the gene database was created, suggesting that these entries serve as markers for when the raw data was accessed from the gene ID systems' sites.
      • The fact that the other 24 gene ID systems do not have dates listed is problematic, as older versions of the data stored on these systems may have been accessed without this verification stamp.
  • Next, I opened the UniProt, RefSeq, and OrderedLocusNames tables to see if the listed IDs were listed in the correct forms.
    • UniProt
      • The UniProt ID column includes accession numbers from the UniProt gene entries. The different possibilities for UniProt accession numbers are detailed on the UniProt website here.
      • All ID's in the UniProt table conform to the following pattern (originally found on the page linked to above): UniProt Ascension Number info.PNG
      • Thus, there are no issues with these 3789 gene IDs.
    • RefSeq
      • The RefSeq ID's are series of numbers with one of three prefixes: "NP_", "YP_", and "WP_". The meanings of these prefixes can be found in the RefSeq documentation found here.
      • "NP_" and "YP_" Prefixes
        • Refer to proteins. There are 3329 ID's with these prefixes (3328 for NP_ and 1 for YP_).
      • "WP_" Prefixes
        • Refer to " autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3328 ID's with these prefixes.
      • Overall, every entry in the ID column was an expected value.
    • OrderedLocusNames
      • 7664 ID entries were present that fell into 4 different patterns:
        • VC_####
        • VC_A####
        • VC####
        • VCA####
      • This ID column was analyzed in Microsoft Excel to investigate why there were 7664 entries present (as opposed to the consistent count of 3831 in SQL and XMLPipeDB Match).
        • The analysis in Excel revealed that the variations of the the gene IDs with and without underscores were replicates of one another. Presumably, these gene IDs were duplicated during the process of exporting the gene database to account for a formatting issue pertaining to underscores. Regardless, this discovery means that only half of the 7664 entries were unique gene IDs. This leaves us with 3832 unique IDs, which is still one more than was expected.
        • Further analysis in Excel revealed that both the XML and SQL gene ID's were off by one due to errors, resulting in the erroneous count of 3831. The XML entries did not include the gene ID VC_A0360.1 (the period presented a formatting issue). The SQL entries did include VC_A0360.1, but an error occurred where two gene IDs were merged into one: VC_1738/VC_1739. The overall number of unique IDs is 3832 when these errors errors are taken into consideration.

Note: The Excel analysis performed above was done following the procedure present on this page- Using Microsoft Excel to Compare ID Lists. The Excel file I used for the analysis itself can be accessed here: File:OrderedLocusNames InvestigationBK1029.xlsx.

Vc-Std_BK_20151027.gdb Use in GenMAPP

Note: The ability of Vc-Std_BK_20151027.gdb to be processed was tested in GenMAPP Version 2.1. Before conducting any of the testing below, Vc-Std_BK_20151027.gdb was selected as the "Gene Database" withing GenMAPP.

Putting a gene on the MAPP using the GeneFinder window

To figure out how to perform this function, I referenced the "Gene Finder" tab within the "GenMAPP Help" window.

  • The following sample IDs were taken from different gene ID systems and tested:
    • UniProt- Q9KQM0
      • The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
      • All expected cross-referenced IDs were present.
      • UniProt Gene Backpage.png
    • RefSeq- NP_230721
      • The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
      • All expected cross-referenced IDs were present.
      • Refseq Gene Backpage.png
    • OrderedLocusNames- VC_0274
      • The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
      • All expected cross-referenced IDs were present.
      • Orderedlocusnames Gene Backpage.png
    • GeneID- 2612551
      • The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
      • All expected cross-referenced IDs were present.
      • GeneID Gene Backpage.png
    • EnsemblBacteria- AAF96503
      • The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
      • All expected cross-referenced IDs were present.
      • EnsemblBacteria Gene Backpage.png
  • Screenshot of all of the sample ID's on a MAPP:
    • All genes on MAPP.png

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

Creating an Expression Dataset in the Expression Dataset Manager

The file File:Merrell Compiled Raw Data Vibrio BK 20151015- Tab Delimited.txt from my Week 8 assignment page was used to create a new Expression Dataset.

  • How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?
  • 5100 of the 5221 gene IDs from the original text file were imported into the Expression Dataset.
  • There were 121 exceptions.
  • Within the EX.txt file, the error message "Gene not found in OrderedLocusNames or any related system." was listed 121 times.
  • Two examples of a genes that encountered the above error message were the genes with the following OrderedLocusNames (in the EX.txt file): VC2209 and VCA1031.
    • When both of these IDs were searched in the Microsoft Access "OrderedLocusNames" table, no results were retrieved.
    • When both of these IDs were searched in the XMLPipeDB Match results file created earlier in this protocol, no results were retrieved.
    • Consequentially, it does not appear that the genes that retrieved error messages were present in the original UniProt XML file.
      • This finding would be explain the observation that individuals using the 2010 V. cholerae gene database last week, such as my Week 8 partner Veronica, also retrieved 121 errors during this step.

Note: 121 errors were encountered during the creation of this new Expression Dataset. A screenshot of the error message is shown here:

  • GenMAPP Errors 20151102.png

Coloring a MAPP with expression data

  • I customized the new Expression Dataset by creating a Color Set. The new Color Set was entitled "LogFoldChange".
    • First, I created a criterion for this color set to label genes that demonstrated a significant increase in their expression.
      • I specified the Gene value as "Avg_LogFC_all" for the Vibrio dataset.
      • I activated the Criteria Builder by clicking the New button and named the criterion "Increased".
      • I selected the color for this criterion as red using the color box.
      • I stated the criterion as follows and added it to the Criteria List: [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05
    • Second, I created a criterion for this color set to label genes that demonstrated a significant decrease in their expression.
      • I specified the Gene value as "Avg_LogFC_all" for the Vibrio dataset.
      • I activated the Criteria Builder by clicking the New button and named the criterion "Decreased".
      • I selected the color for this criterion as green using the color box.
      • I stated the criterion as follows and added it to the Criteria List: [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05
  • Upon entering these color sets, I saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated my .gex file with the new Color Set.

Screenshot of Color Set criteria:

  • Colorset.png

Note: No errors were encountered in the creation of the Color Set.

Running MAPPFinder

  • I launched the MAPPFinder program from within GenMAPP and ensured that the Vc-Std_BK_20151027.gdb Gene Database was still loaded into GenMAPP.
  • I clicked on the button "Calculate New Results" followed by "Find File", at which point I the .gex file updated during the creation of the "LogFoldChange" color set.
  • I chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange Color Set to the data.
  • I checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
    • This analysis took several minutes to complete.
  • MAPPFinder analysis results
    • Initial result in MAPPFinder Browser:
      • MAPPFinder Result BK20151103.png
    • I clicked on the menu item "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
    • MAPPFinder RankedList BK20151103.png
      • As an aside, it is interesting to note that few of the top 10 gene ontology results from this MAPPFinder analysis were conserved from my analysis using the 2009 Vibrio cholerae gene database. This suggests the high degree of change and progress made in genomics in the last six years.

Note: The MAPPFinder analysis took approximately 16 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the gene database created this week.

Compare Gene Database to Outside Resource

I will get back to this question during the Group Final Project, as was allowed:

This section applies to the Group Final Project and does not need to be completed for the 
Week 9 assignment. —  Kdahlquist (talk) 15:46, 2 November 2015 (PST)

The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.

Links

Assignments Pages

Individual Journal Entries

Shared Journal Entries