Anuvarsh Week 9
***All files can be accessed here***
All procedures below were modified from the following pages:
- Running GenMAPP Builder
- How Do I Count Thee? Let Me Count The Ways
- Using Microsoft Excel to Compare ID Lists
- Gene Database Testing Report Sample
Contents
Pre-requisites
This procedure was done in a Windows environment. While it is possible to run GenMAPP Builder under the Mac or Linux OS, the end product, a GenMAPP-compatible Gene Database (.gdb), can only be used with the GenMAPP program, which can only be run on Windows. This set of software has already been installed on the computers in the Seaver 120 computer lab. Prior to proceeding through this procedure, my machine had the following tools and programs:
- 7-zip to extract any zipped files.
- PostgreSQL on Windows (http://www.enterprisedb.com/products-services-training/pgdownload)
- This procedure was written using PostgreSQL 9.4.x.
- GenMAPP Builder (https://sourceforge.net/projects/xmlpipedb/files/)
- Java JDK 1.8 64-bit
- Download page
- File to download is: jdk-8u65-windows-x64.exe
- GenMAPP 2 can be downloaded here. The file to download is "GenMAPPv2Setup.exe".
- XMLPipeDB match utility (https://sourceforge.net/projects/xmlpipedb/files/) for counting IDs in XML files
- Microsoft Access or any other tool that can read .mdb files
Export Information
Version of GenMAPP Builder: gmbuilder-3.0.0-build-5
Computer on which export was run: back row, second from door.
Postgres Database name: vcholera-20151027-gmb3build5-AV
UniProt XML filename (give filename and upload and link to compressed file):
- UniProt XML version (The version information can be found at the UniProt News Page): UniProt release 2015_10
- UniProt XML download link: http://www.uniprot.org/uniprot/?query=organism:243277
- Clicked on the Download link, and without changing any default settings, click download.
- Time taken to import: 3.06 minutes
- Note:
GO OBO-XML filename (give filename and upload and link to compressed file):
- GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [GO Download page] has been unzipped): 10/27/2015, 2:24am
- GO OBO-XML download link: http://geneontology.org/page/download-ontology#Legacy_Downloads
- Clicked on obo-xml.gz link in far right column of second row of XML format table.
- Time taken to import: 7.58 minutes
- Time taken to process: 4.37 minutes
- Note:
GOA filename (give filename and upload and link to compressed file):
- GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site): 10/13/15
- GOA download link: http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/46.V_cholerae_ATCC_39315.goa
- Opened page and saved as .goa file.
- Time taken to import: 0.06 minutes
- Note: Import was almost immediate.
Name of .gdb file (give filename and upload and link to compressed file):
- Time taken to export: 1 hour 17.47 minutes
- Start time: 10/27/2015, 3:52:08PM PDT
- End time: 10/27/2015, 5:09:55PM PDT
Note:
TallyEngine
- Ran the TallyEngine in GenMAPP Builder and recorded the number of records for UniProt and GO in the XML data and in the Postgres databases.
- After running PostgreSQL and making sure my database was running, I ran GenMAPP builder and connected it to the database.
- After performing an import, I chose Run XML and Database Tallies for Uniprot and and selected the UniProt and GO files that I imported.
- My Tally results are in the screenshot below:
- The Tally results indicated counts for unique genes (labelled as Ordered Locus)
- XML: 3831 unique genes
- Database: 3831 unique genes
Using XMLPipeDB match to Validate the XML Results from the TallyEngine
- In order to check the number of genes via xmlpipedb match, I used a command that utilized match for the file containing all of the uniprot genes.
- Command Used:
java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < uniprot-organism%3A243277.xml"
- XMLPipeDB Match returned 2738 unique genes.
- This number is different than the numbers the TallyEngine returned.
- Upon later analysis (documented later in this lab notebook), I realized that the discrepancy between these counts is mostly attributed to the several different formats between which the genes are presented in the UniProt file. Despite all of these formats, we only counted the genes named in the format "VC_####" which disregards the other legitimate naming format.
- We modified our command to take into consideration both 'VC_####' and 'VC_A####' as legitimate gene names. This resulted in xmlpipedb-match indicating 3831 unique genes.
- Command Used:
java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < uniprot-organism%3A243277.xml > ordered-locus_match-results_AV.txt
- These new results match the results of the Tally Engine.
- This result was sent to the file ordered-locus_match-results_AV.txt
Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine
- On PostgreSQL, I searched through the database using a variation of the select count(*) query.
- Command used:
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';
- PostgreSQL returned 2737 unique genes.
- These results are different for the same reason as the xmlpipedb-match results. This search only counted one format of gene names.
- When we reformatted the query to take into account both 'VC_####' and 'VC_A####', PostgreSQL indicated 3831 unique genes.
- Command used:
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';
- This new count is the same as the xml and database results from the TallyEngine.
OriginalRowCounts Comparison
Within the .gdb file, I looked at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. I compared the tables and records with a benchmark .gdb file. The benchmark .gdb file that I used was the 2010 V. cholera database from the Week 8 DNA Microarray Analysis Journal.
Benchmark .gdb file: Vc-Std_External_20101022
Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:
- New gdb:
- 2010 gdb:
- Note:
- Both files indicated the same number of UniProt-OrderedLocusNames.
- The new file contains 10 more tables than the previous version.
- Most counts in each Table in the new gdb are much higher than their corresponding counts in the 2010 gdb.
Visual Inspection
Perform visual inspection of individual tables to see if there are any problems.
- Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
- The Systems table does not have gene ID information. Instead, it has a column titled "System Name" that seems to refer to all of the systems this database could be associated with. It does not have dates next to all of the databases mentioned.
- Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
- The UniProt table seems to have the correct form for that type of ID. All of the ID's are in the format (where L = Letter and # = any single digit number): L#LL[L or #]#
- The RefSeq table seems to have the correct form for that type of ID. All of the ID's are in the format (where # = any single digit number): [N or W]P_[###### or #########]
- The OrderedLocusNames table has IDs in one of the following formats (where # = any single digit number):
- VC_####
- VC_A####
- VC####
- VCA####
- This variation between underscore and no underscore is intentional and is because the programs double the number of entries in the database. We only want to consider the entries with the underscore as the others are duplicates. The variation between A and no A is due to two different types of ID formats, and these differences were considered when counting the number of unique gene ID's.
- The OrderedLocusNames table indicates 7664 unique gene IDs. We can divide that number by half to solve for the duplicate problem as previously stated. This brings us to 3832 unique gene IDs. There is still a discrepancy between this number and the number that we found in all of the other counting methods. This means that there are 2 IDs in the OrderedLocusNames table that have not been accounted for in any of the other methods (or just 1 under the assumption that there is a duplicate similar to all of the other genes).
- This discrepancy between genes is due to one gene ID, "VC_A0360.1", being in a completely different format than all others. This results in 2 extra IDs in the OrderedLocusNames table because both forms of this gene ID (with and without the underscore) exist in the table. This is due to the duplicates problem that was addressed earlier.
.gdb Use in GenMAPP
Putting a gene on the MAPP using the GeneFinder window
- Tried a "" from each of the gene ID systems. Open the Backpage and saw that all of the cross-referenced IDs that are supposed to be there are there.
Creating an Expression Dataset in the Expression Dataset Manager
- How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?
- Uploading the Merrell et al. .txt file modified for GenMAPP from last week using the new database resulted in 121 errors. Surprisingly, this is the same number of errors that appeared when uploading this same data with the 2010 version of the V. cholera database.
- 5,221 IDs were imported with the new database. This is the same as the number of IDs imported with the 2010 database.
- All of the Errors in the .EX.txt file said: "Gene not found in OrderedLocusNames or any related system."
- I opened the .EX.txt file and searched for the genes that were documented with an error in the "Uniprot" table in the database file via Microsoft Access. I spot checked 6 genes ("VC2209
", "VCA1031", "VCA0745", "VC1476", "VCA0534", "VCA0276") and noticed that none of these genes appeared in the "Uniprot" table in the database. Because of this, I can conclude that these genes were not present in the UniProt XML.
Coloring a MAPP with expression data
Note: When I double click on one of the GO terms to open the data in GenMAPP, I get the following error:
I am not sure what is causing this error, but I am guessing that this is due to some error within my database in regards to the format of data. Another explanation is that this the new format of this data is not compatible with MAPPFinder or GenMAPP. I am unable to determine which of these explanations is more likely at this point.
Running MAPPFinder
The following are some screenshots from running MAPPFinder using my database.
Note: Once again, there are several differences between GO terms between the top ranked GO terms from the 2015 database and those found in the 2010 and 2009 version of the V. cholera database. These differences, as explained in the week 8 journal, are most likely due to new information made available through UniProt, GO, and GOA.
Other Links
User Page: Anindita Varshneya
Class Page: BIOL/CMSI 367: Biological Databases, Fall 2015
Group Page: GÉNialOMICS
Assignment Pages
Week 1 Assignment
Week 2 Assignment
Week 3 Assignment
Week 4 Assignment
Week 5 Assignment
Week 6 Assignment
Week 7 Assignment
Week 8 Assignment
Week 9 Assignment
Week 10 Assignment
Week 11 Assignment
Week 12 Assignment
No Week 13 Assignment
Week 14 Assignment
Week 15 Assignment
Individual Journals
Individual Journal Week 2
Individual Journal Week 3
Individual Journal Week 4
Individual Journal Week 5
Individual Journal Week 6
Individual Journal Week 7
Individual Journal Week 8
Individual Journal Week 9
Individual Journal Week 10
Individual Journal Week 11
Individual Journal Week 12
Individual Journal Week 14
Individual Journal Week 15
Class Journal Week 1
Class Journal Week 2
Class Journal Week 3
Class Journal Week 4
Class Journal Week 5
Class Journal Week 6
Class Journal Week 7
Class Journal Week 8
Class Journal Week 9
GÉNialOMICS Journal Week 10
GÉNialOMICS Journal Week 11
GÉNialOMICS Journal Week 12
GÉNialOMICS Journal Week 14
GÉNialOMICS Journal Week 15