Difference between revisions of "Bklein7 Week 9"
m (edited image format) |
(added info regarding the comparison) |
||
Line 187: | Line 187: | ||
*The "OrderedLocusNames" listing was also present here. | *The "OrderedLocusNames" listing was also present here. | ||
*The table reported the same number of distinct Ordered Locus Names: 7664. | *The table reported the same number of distinct Ordered Locus Names: 7664. | ||
− | *Although the above value was the same, row counts for various other tables such as "GeneOntologyTree" dramatically | + | *Although the above value was the same, row counts for various most of the other tables such as "GeneOntologyTree" dramatically increased between the 2010 and 2015 databases. |
+ | *Overall, 10 more tables were listed in the 2015 database. | ||
OriginalRowCounts table from the benchmark and new gdb: | OriginalRowCounts table from the benchmark and new gdb: |
Revision as of 06:52, 3 November 2015
Contents
- 1 Screenshots
- 2 Pre-requisites
- 3 Gene Database Creation
- 4 Gene Database Testing Report
- 4.1 Export Information
- 4.2 TallyEngine
- 4.3 Using XMLPipeDB match to Validate the XML Results from the TallyEngine
- 4.4 Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine
- 4.5 OriginalRowCounts Comparison
- 4.6 Visual Inspection
- 4.7 .gdb Use in GenMAPP
- 4.8 Putting a gene on the MAPP using the GeneFinder window
- 4.9 Creating an Expression Dataset in the Expression Dataset Manager
- 4.10 Coloring a MAPP with expression data
- 4.11 Running MAPPFinder
- 4.12 Compare Gene Database to Outside Resource
- 5 Links
Screenshots
Pre-requisites
The following set of software was used in the creation and testing of the 'Vibrio cholerae gene database:
- 7-ziptool that for unpacking .gz and .zip files
- PostgreSQL on Windows (version 9.4.x)
- GenMAPP Builder
- Java JDK 1.8 64-bit
- GenMAPP 2
- XMLPipeDB match utility for counting IDs in XML files
- Microsoft Access for reading .mdb files
Gene Database Creation
Downloading Data Source Files and GenMAPP Builder
- I download the UniProt XML, GOA, and GO OBO-XML files for Vibrio choleraealong with the GenMAPP Builder program.
- All files were saved to the folder Bklein7_Week9 on my computer's ThawSpace.
- Files that required extraction were unzipped using 7-zip.
- Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.
UniProt XML
- I went to the UniProt Complete Proteomes page.
- From there, I navigated to the complete proteome download page for Vibrio cholerae serotype O1 (strain ATCC 39315 / El Tor Inaba N16961).
- I clicked on the "Download" button at the top of the page above and selected the following options:
- "Download all"
- "XML" from the "Format" drop-down menu
- "Compressed" format
- I extracted the file using 7-zip.
GOA
- UniProt-GOA files can be downloaded from the UniProt-GOA ftp site.
- Within the above site, I navigated to the GO annotations for Vibrio cholerae.
- This text file was automatically opened by my browser. Therefore, I had to manually download the file.
GO OBO-XML
- I downloaded the GO OBO-XML formatted file from the Gene Ontology legacy download page.
- I extracted the file using 7-zip.
Downloaded GenMAPP Builder
- I downloaded the GenMAPP Builder zip folder: Download gmbuilder-3.0.0-build-5.zip.
- I extracted the GenMAPP Builder folder using 7-zip.
Creating the New Database in PostgreSQL
- I launched pgAdmin III and connected to the PostgreSQL 9.4 server (localhost:5432).
- On this server, I created a new database: V.cholerae_20151027_gmb3build5.
- I opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
- I clicked on the Open File icon and selected the file gmbuilder.sql. This imported a series of SQL commands into the editor tab.
- I clicked on the Execute Query icon to run this command.
- In viewing the schema for this database, I confirmed that there were 167 tables after running the above command.
Configuring GenMAPP Builder to Connect to the PostgreSQL Database
- To begin, I launched gmbuilder.bat.
- I selected the "Configure Database" option and entered the following information into the fields below:
- Host or address: localhost
- Port number: 5432
- Database name: V.cholerae_20151027_gmb3build5
- Username: postgres
- Password: Welcome1
Importing Data into the PostgreSQL Database
- The downloaded data files for Vibrio cholerea were specified and imported into the database by clicking on the following buttons:
- Selected File > Import UniProt XML...
- Selected File > Import GO OBO-XML...
- Clicked OK to the message asking to process the GO data.
- Selected File > Import GOA...
Exporting a GenMAPP Gene Database (.gdb)
- I selected File > Export to GenMAPP Gene Database... to begin the export process.
- I typed my name in the owner field (Brandon Klein).
- I selectedVibrio cholerae as the gene database species and then clicked Next.
- The database was saved as Vc-Std_BK_20151027.
- I checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
- Finally, I clicked the "Next" button to begin the export process.
Gene Database Testing Report
Export Information
Version of GenMAPP Builder: Version 3.0.0 Build 5
Computer on which export was run: Seaver 120- Last computer on the right in the row closest to the front of the room
Postgres Database name: V.cholerae_20151027_gmb3build5
UniProt XML filename: uniprot-organism-243277_BK_20151027.xml
- UniProt XML version: UniProt release 2015_10 (version information can be accessed on the UniProt News Page)
- UniProt XML download link: http://www.uniprot.org/uniprot/?query=organism:243277
- Time taken to import: 2.92 minutes
- Note: The import time was similar to my homework partner's, and no interruptions occurred during this process.
GO OBO-XML filename: go_daily-termdb_BK_20151027.obo-xml
- GO OBO-XML version: File last modified on Tuesday, October 27, 2015, 2:24:02 AM (accessed from the file properties window)
- GO OBO-XML download link: http://geneontology.org/page/download-ontology#Legacy_Downloads
- Time taken to import: 6.88 minutes
- Time taken to process: 4.49 minutes
- Note: The import and processing times were once again similar to my homework partner's, and no interruptions occurred during these processes.
GOA filename: 46.V_cholerae_ATCC_39315_BK_20151027.goa
- GOA version: File last modified on Wednesday, October 14, 2015 (accessed on the UniProt-GOA news page)
- GOA download link: http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/46.V_cholerae_ATCC_39315.goa
- Time taken to import: 0.06 minutes
- Note: This import process was equally quick for my homework partner and presented no complications.
Name of .gdb file: Vc-Std_BK_20151027.gdb
- Time taken to export: 1 hour, 27 minutes
- Start time: 4:44 PM (restarted export after class)
- End time: 6:11 PM
- Note: Upon checking on my export progress after class, I found out that the export had prematurely stopped. Therefore, I restarted the process. When I returned to the S120 computer lab on 10/29, the export had completed successfully and all windows remained open.
TallyEngine
- I ran the Tally Engine in GenMAPP Builder to process the number of records for UniProt and GO in the XML data and in the Postgres databases. This was done to verify that the XML data transferred consistently into the PostgreSQL database.
Using XMLPipeDB match to Validate the XML Results from the TallyEngine
On the command line ('cmd on Windows), I entered the folder Bklein7_Week9 to access the XML data files. The first match I ran was designed to identify matches of any ordered locus name following the pattern VC_#### in the UniProt XML file. The command sequence used is as follows:
java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml" > OrderedLocusNames.txt
The results of this match are pictured below.
- The number of unique matches yielded by XMLPipeDB Match, 2738, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries.
- I investigated this discrepancy by opening the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access. Within Access, I opened the "OrdereLocusNames" table.
- Quickly skimming through this tablerevealed a large number of Ordered Locus entries that did not follow the pattern VC_####. Instead, these entries followed the pattern VC_A####. Therefore, I edited the pattern I input into XMLPipeDB Match.
The second match I ran accounted for the issues above. The command sequence used is as follows:
java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml" > OrderedLocusNames.txt
The results of this match are pictured below.
- With this updated pattern, XMLPipeDB yielded 3831 unique matches. This matched the Tally Engine results.
- Upon accounting for the alternate Ordered Locus entry pattern, XMLPipeDB validated the results of Tally Engine by calculating the same number of entries in the source data files used in the SQL database.
Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine
The gene/name tags in the XML file land in the genenametype table, which contains ordered locus values. As in XMLPipeDB Match, I started with a SQL query designed to match the pattern VC_####:
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';
The results of this query are pictured below.
- The number of unique matches yielded by this SQL query, 2737, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries. Further, it had one less entry than the original XMLPipeDB query.
- As with the first XMLPipeDB match, the main issue here was the fact that I did not account for the occasional presence of the letter "A" in the Ordered Locus entries.
I ran a second SQL query to include the pattern VC_A#### in the count output:
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';
The results of this query are pictured below.
- With the secondary pattern included, this SQL query yielded 3831 unique matches. This matched the Tally Engine results as well as the updated XMLPipeDB Match results.
- These findings further validated the Tally Engine results.
OriginalRowCounts Comparison
I opened the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access to see if the the expected tables were listed with the expected number of records. This information was accessed in the OriginalRowCounts table.
- A listing named "OrderedLocusNames" was present.
- The table reported 7664 distinct Ordered Locus Names.
- This value is equal to (3831*2)+1
Benchmark .gdb file: File:Vc-Std External 20101022.gdb
As a sanity check, I compared the information in this table to the OriginalRowCounts table present in the 2010 Vibrio cholerae gene database that we worked with last week.
- The "OrderedLocusNames" listing was also present here.
- The table reported the same number of distinct Ordered Locus Names: 7664.
- Although the above value was the same, row counts for various most of the other tables such as "GeneOntologyTree" dramatically increased between the 2010 and 2015 databases.
- Overall, 10 more tables were listed in the 2015 database.
OriginalRowCounts table from the benchmark and new gdb:
Note: The OriginalRowCounts tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the OriginalRowCounts tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row counts are highlighted in yellow.
Visual Inspection
I visually inspected individual tables within the gene database using Microsoft Access to see if there were any problems.
- First, I looked at the Systems table. This table includes columns detailing 35 different gene ID systems from which data was accessed.
- Of the 35 gene ID systems, only 11 have dates listed in the date column.
- Those that do have values entered here all share the same date: 10/27/2015. This was the day on which the gene database was created, suggesting that these entries serve as markers for when the raw data was accessed from the gene ID systems' sites.
- The fact that the other 24 gene ID systems do not have dates listed is problematic, as older versions of the data stored on these systems may have been accessed without this verification stamp.
- Of the 35 gene ID systems, only 11 have dates listed in the date column.
- Next, I opened the UniProt, RefSeq, and OrderedLocusNames tables to see if the listed IDs were listed in the correct forms.
- UniProt
- The UniProt ID column includes accession numbers from the UniProt gene entries. The different possibilities for UniProt accession numbers are detailed on the UniProt website here.
- All ID's in the UniProt table conform to the following pattern (originally found on the page linked to above):
- Thus, there are no issues with these 3789 gene IDs.
- RefSeq
- The RefSeq ID's are series of numbers with one of three prefixes: "NP_", "YP_", and "WP_". The meanings of these prefixes can be found in the RefSeq documentation found here.
- "NP_" and "YP_" Prefixes
- Refer to proteins. There are 3329 ID's with these prefixes (3328 for NP_ and 1 for YP_).
- "WP_" Prefixes
- Refer to " autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3328 ID's with these prefixes.
- Overall, every entry in the ID column was an expected value.
- OrderedLocusNames
- 4 different patterns were present for entries in the ID column:
- VC_####
- VC_A####
- VC####
- VCA####
- 4 different patterns were present for entries in the ID column:
- UniProt
Note:
.gdb Use in GenMAPP
Note:
Putting a gene on the MAPP using the GeneFinder window
- Try a sample ID from each of the gene ID systems. Open the Backpage and see if all of the cross-referenced IDs that are supposed to be there are there.
Note:
Creating an Expression Dataset in the Expression Dataset Manager
- How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?
Note:
Coloring a MAPP with expression data
Note:
Running MAPPFinder
Note:
Compare Gene Database to Outside Resource
I will get back to this question during the Group Final Project, as was allowed:
This section applies to the Group Final Project and does not need to be completed for the Week 9 assignment. — Kdahlquist (talk) 15:46, 2 November 2015 (PST)
The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.
Links
- User Page: Brandon Klein
- Team Page: The Class Whoopers
Assignments Pages
- Week 1 Assignment
- Week 2 Assignment
- Week 3 Assignment
- Week 4 Assignment
- Week 5 Assignment
- Week 6 Assignment
- Week 7 Assignment
- Week 8 Assignment
- Week 9 Assignment
- Week 10 Assignment
- Week 11 Assignment
- Week 12 Assignment
- No Week 13 Assignment
- Week 14 Assignment
- Week 15 Assignment
Individual Journal Entries
- Week 1 Individual Journal
- Week 2 Individual Journal
- Week 3 Individual Journal
- Week 4 Individual Journal
- Week 5 Individual Journal
- Week 6 Individual Journal
- Week 7 Individual Journal
- Week 8 Individual Journal
- Week 9 Individual Journal
- Week 10 Individual Journal
- Week 11 Individual Journal
- Week 12 Individual Journal
- No Week 13 Journal
- Week 14 Individual Journal
- Week 15 Individual Journal
- Week 1 Class Journal
- Week 2 Class Journal
- Week 3 Class Journal
- Week 4 Class Journal
- Week 5 Class Journal
- Week 6 Class Journal
- Week 7 Class Journal
- Week 8 Class Journal
- Week 9 Class Journal
- Week 10 Team Journal
- Week 11 Team Journal
- Week 12 Team Journal
- No Week 13 Journal
- Week 14 Team Journal
- Week 15 Team Journal