Difference between revisions of "Bklein7 Week 9"

From LMU BioDB 2015
Jump to: navigation, search
(updated originalrowcounts comparison)
(further updated originalrowcounts comparison)
Line 177: Line 177:
 
===OriginalRowCounts Comparison===
 
===OriginalRowCounts Comparison===
  
I opened the gene database file ''Vc-Std_BK_20151027.gdb'' in Microsoft Access to see if the the expected tables were listed with the expected number of records. This information was accessed in the ''OriginalRowCounts'' table.
+
I opened the gene database file [[Media:Vc-Std 20151027.zip|Vc-Std_BK_20151027.gdb]] in Microsoft Access to see if the the expected tables were listed with the expected number of records. This information was accessed in the ''OriginalRowCounts'' table.
 
*A listing named "OrderedLocusNames" was present.
 
*A listing named "OrderedLocusNames" was present.
 
*The table reported 7664 distinct Ordered Locus Names.
 
*The table reported 7664 distinct Ordered Locus Names.
Line 189: Line 189:
 
*Although the above value was the same, row counts for various other tables such as "GeneOntologyTree" dramatically shifted between the 2010 and 2015 databases.
 
*Although the above value was the same, row counts for various other tables such as "GeneOntologyTree" dramatically shifted between the 2010 and 2015 databases.
  
OriginalRowCounts table from the benchmark and new gdb: [[File:OriginalRowCounts ComparisonImage.PNG]]
+
OriginalRowCounts table from the benchmark and new gdb:  
 +
[[File:OriginalRowCounts ComparisonImage.PNG]]
  
 
Note: The ''OriginalRowCounts'' tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the ''OriginalRowCounts'' tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row counts are highlighted in yellow.
 
Note: The ''OriginalRowCounts'' tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the ''OriginalRowCounts'' tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row counts are highlighted in yellow.

Revision as of 05:52, 3 November 2015

Screenshots

TallyresultsBK1029.png PostgresIDsBK1029.png PostgresIDsBK1029 UPDATED.png XmlpipedbmatchoutputBK1029.png XmlpipedbmatchoutputBK1029 Updated.png

Pre-requisites

The following set of software was used in the creation and testing of the 'Vibrio cholerae gene database:

  1. 7-ziptool that for unpacking .gz and .zip files
  2. PostgreSQL on Windows (version 9.4.x)
  3. GenMAPP Builder
  4. Java JDK 1.8 64-bit
  5. GenMAPP 2
  6. XMLPipeDB match utility for counting IDs in XML files
  7. Microsoft Access for reading .mdb files

Gene Database Creation

Downloading Data Source Files and GenMAPP Builder

  • I download the UniProt XML, GOA, and GO OBO-XML files for Vibrio choleraealong with the GenMAPP Builder program.
    • All files were saved to the folder Bklein7_Week9 on my computer's ThawSpace.
    • Files that required extraction were unzipped using 7-zip.
    • Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

UniProt XML

GOA

GO OBO-XML

Downloaded GenMAPP Builder

  1. I downloaded the GenMAPP Builder zip folder: Download gmbuilder-3.0.0-build-5.zip.
  2. I extracted the GenMAPP Builder folder using 7-zip.

Creating the New Database in PostgreSQL

  • I launched pgAdmin III and connected to the PostgreSQL 9.4 server (localhost:5432).
    • On this server, I created a new database: V.cholerae_20151027_gmb3build5.
    • I opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
      • I clicked on the Open File icon and selected the file gmbuilder.sql. This imported a series of SQL commands into the editor tab.
      • I clicked on the Execute Query icon to run this command.
      • In viewing the schema for this database, I confirmed that there were 167 tables after running the above command.

Configuring GenMAPP Builder to Connect to the PostgreSQL Database

  • To begin, I launched gmbuilder.bat.
  • I selected the "Configure Database" option and entered the following information into the fields below:
    • Host or address: localhost
    • Port number: 5432
    • Database name: V.cholerae_20151027_gmb3build5
    • Username: postgres
    • Password: Welcome1

Importing Data into the PostgreSQL Database

  • The downloaded data files for Vibrio cholerea were specified and imported into the database by clicking on the following buttons:
    • Selected File > Import UniProt XML...
    • Selected File > Import GO OBO-XML...
    • Clicked OK to the message asking to process the GO data.
    • Selected File > Import GOA...

Exporting a GenMAPP Gene Database (.gdb)

  • I selected File > Export to GenMAPP Gene Database... to begin the export process.
  • I typed my name in the owner field (Brandon Klein).
  • I selectedVibrio cholerae as the gene database species and then clicked Next.
  • The database was saved as Vc-Std_BK_20151027.
  • I checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
  • Finally, I clicked the "Next" button to begin the export process.

Gene Database Testing Report

Export Information

Version of GenMAPP Builder: Version 3.0.0 Build 5

Computer on which export was run: Seaver 120- Last computer on the right in the row closest to the front of the room

Postgres Database name: V.cholerae_20151027_gmb3build5

UniProt XML filename: uniprot-organism-243277_BK_20151027.xml

GO OBO-XML filename: go_daily-termdb_BK_20151027.obo-xml

  • GO OBO-XML version: File last modified on ‎Tuesday, ‎October ‎27, ‎2015, ‏‎2:24:02 AM (accessed from the file properties window)
  • GO OBO-XML download link: http://geneontology.org/page/download-ontology#Legacy_Downloads
  • Time taken to import: 6.88 minutes
  • Time taken to process: 4.49 minutes
    • Note: The import and processing times were once again similar to my homework partner's, and no interruptions occurred during these processes.

GOA filename: 46.V_cholerae_ATCC_39315_BK_20151027.goa

Name of .gdb file: Vc-Std_BK_20151027.gdb

  • Time taken to export: 1 hour, 27 minutes
    • Start time: 4:44 PM (restarted export after class)
    • End time: 6:11 PM
    • Note: Upon checking on my export progress after class, I found out that the export had prematurely stopped. Therefore, I restarted the process. When I returned to the S120 computer lab on 10/29, the export had completed successfully and all windows remained open.

TallyEngine

  • I ran the Tally Engine in GenMAPP Builder to process the number of records for UniProt and GO in the XML data and in the Postgres databases. This was done to verify that the XML data transferred consistently into the PostgreSQL database.
    • I selected the menu item Tallies > Run XML and Database Tallies for UniProt and GO... to run the Tally Engine. The results are pictured below:
    • TallyresultsBK1029.png
      • XML and database counts were consistent among all assessed data types.
      • Note: the "Ordered Locus" count was 3831.

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

On the command line ('cmd on Windows), I entered the folder Bklein7_Week9 to access the XML data files. The first match I ran was designed to identify matches of any ordered locus name following the pattern VC_#### in the UniProt XML file. The command sequence used is as follows:

java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml" >       
OrderedLocusNames.txt

The results of this match are pictured below. XmlpipedbmatchoutputBK1029.png

  • The number of unique matches yielded by XMLPipeDB Match, 2738, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries.
  • I investigated this discrepancy by opening the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access. Within Access, I opened the "OrdereLocusNames" table.
    • Quickly skimming through this tablerevealed a large number of Ordered Locus entries that did not follow the pattern VC_####. Instead, these entries followed the pattern VC_A####. Therefore, I edited the pattern I input into XMLPipeDB Match.

The second match I ran accounted for the issues above. The command sequence used is as follows:

java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml"   
> OrderedLocusNames.txt

The results of this match are pictured below. XmlpipedbmatchoutputBK1029 Updated.png

  • With this updated pattern, XMLPipeDB yielded 3831 unique matches. This matched the Tally Engine results.
    • Upon accounting for the alternate Ordered Locus entry pattern, XMLPipeDB validated the results of Tally Engine by calculating the same number of entries in the source data files used in the SQL database.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

The gene/name tags in the XML file land in the genenametype table, which contains ordered locus values. As in XMLPipeDB Match, I started with a SQL query designed to match the pattern VC_####:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';

The results of this query are pictured below. PostgresIDsBK1029.png

  • The number of unique matches yielded by this SQL query, 2737, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries. Further, it had one less entry than the original XMLPipeDB query.
    • As with the first XMLPipeDB match, the main issue here was the fact that I did not account for the occasional presence of the letter "A" in the Ordered Locus entries.

I ran a second SQL query to include the pattern VC_A#### in the count output:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';

The results of this query are pictured below. PostgresIDsBK1029 UPDATED.png

  • With the secondary pattern included, this SQL query yielded 3831 unique matches. This matched the Tally Engine results as well as the updated XMLPipeDB Match results.
    • These findings further validated the Tally Engine results.

OriginalRowCounts Comparison

I opened the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access to see if the the expected tables were listed with the expected number of records. This information was accessed in the OriginalRowCounts table.

  • A listing named "OrderedLocusNames" was present.
  • The table reported 7664 distinct Ordered Locus Names.
    • This value is equal to (3831*2)+1

Benchmark .gdb file: File:Vc-Std External 20101022.gdb

As a sanity check, I compared the information in this table to the OriginalRowCounts table present in the 2010 Vibrio cholerae gene database that we worked with last week.

  • The "OrderedLocusNames" listing was also present here.
  • The table reported the same number of distinct Ordered Locus Names: 7664.
  • Although the above value was the same, row counts for various other tables such as "GeneOntologyTree" dramatically shifted between the 2010 and 2015 databases.

OriginalRowCounts table from the benchmark and new gdb: OriginalRowCounts ComparisonImage.PNG

Note: The OriginalRowCounts tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the OriginalRowCounts tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row counts are highlighted in yellow.

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

  • Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
  • Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?

Note:

.gdb Use in GenMAPP

Note:

Putting a gene on the MAPP using the GeneFinder window

  • Try a sample ID from each of the gene ID systems. Open the Backpage and see if all of the cross-referenced IDs that are supposed to be there are there.

Note:

Creating an Expression Dataset in the Expression Dataset Manager

  • How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

Note:

Coloring a MAPP with expression data

Note:

Running MAPPFinder

Note:


Compare Gene Database to Outside Resource

I will get back to this question during the Group Final Project, as was allowed:

This section applies to the Group Final Project and does not need to be completed for the 
Week 9 assignment. —  Kdahlquist (talk) 15:46, 2 November 2015 (PST)

The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.

Links

Assignments Pages

Individual Journal Entries

Shared Journal Entries