Difference between revisions of "Bklein7 Week 9"

From LMU BioDB 2015
Jump to: navigation, search
m (changed image formatting)
(added UniProt information to the visual inspection section)
Line 196: Line 196:
 
===Visual Inspection===
 
===Visual Inspection===
  
Perform visual inspection of individual tables to see if there are any problems.
+
I visually inspected individual tables within the gene database using Microsoft Access to see if there were any problems.
  
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
+
* First, I looked at the Systems table. This table includes columns detailing 35 different gene ID systems from which data was accessed.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
+
**Of the 35 gene ID systems, only 11 have dates listed in the date column.
 +
*** Those that do have values entered here all share the same date: 10/27/2015. This was the day on which the gene database was created, suggesting that these entries serve as markers for when the raw data was accessed from the gene ID systems' sites.
 +
*** The fact that the other 24 gene ID systems do not have dates listed is problematic, as older versions of the data stored on these systems may have been accessed without this verification stamp.
 +
* Next, I opened the UniProt, RefSeq, and OrderedLocusNames tables to see if the listed IDs were listed in the correct forms.
 +
**UniProt
 +
***The UniProt ID column includes accession numbers from the UniProt gene entries. The different possibilities for UniProt accession numbers are detailed on the UniProt website [http://www.uniprot.org/help/accession_numbers here].
 +
***All ID's in the UniProt table conform to the following pattern (originally found on the page linked to above): [[File:UniProt Ascension Number info.PNG]]
 +
***Thus, there are no issues with these 3789 gene IDs.
 +
**
  
 
Note:
 
Note:

Revision as of 06:29, 3 November 2015

Screenshots

TallyresultsBK1029.png PostgresIDsBK1029.png PostgresIDsBK1029 UPDATED.png XmlpipedbmatchoutputBK1029.png XmlpipedbmatchoutputBK1029 Updated.png

Pre-requisites

The following set of software was used in the creation and testing of the 'Vibrio cholerae gene database:

  1. 7-ziptool that for unpacking .gz and .zip files
  2. PostgreSQL on Windows (version 9.4.x)
  3. GenMAPP Builder
  4. Java JDK 1.8 64-bit
  5. GenMAPP 2
  6. XMLPipeDB match utility for counting IDs in XML files
  7. Microsoft Access for reading .mdb files

Gene Database Creation

Downloading Data Source Files and GenMAPP Builder

  • I download the UniProt XML, GOA, and GO OBO-XML files for Vibrio choleraealong with the GenMAPP Builder program.
    • All files were saved to the folder Bklein7_Week9 on my computer's ThawSpace.
    • Files that required extraction were unzipped using 7-zip.
    • Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

UniProt XML

GOA

GO OBO-XML

Downloaded GenMAPP Builder

  1. I downloaded the GenMAPP Builder zip folder: Download gmbuilder-3.0.0-build-5.zip.
  2. I extracted the GenMAPP Builder folder using 7-zip.

Creating the New Database in PostgreSQL

  • I launched pgAdmin III and connected to the PostgreSQL 9.4 server (localhost:5432).
    • On this server, I created a new database: V.cholerae_20151027_gmb3build5.
    • I opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
      • I clicked on the Open File icon and selected the file gmbuilder.sql. This imported a series of SQL commands into the editor tab.
      • I clicked on the Execute Query icon to run this command.
      • In viewing the schema for this database, I confirmed that there were 167 tables after running the above command.

Configuring GenMAPP Builder to Connect to the PostgreSQL Database

  • To begin, I launched gmbuilder.bat.
  • I selected the "Configure Database" option and entered the following information into the fields below:
    • Host or address: localhost
    • Port number: 5432
    • Database name: V.cholerae_20151027_gmb3build5
    • Username: postgres
    • Password: Welcome1

Importing Data into the PostgreSQL Database

  • The downloaded data files for Vibrio cholerea were specified and imported into the database by clicking on the following buttons:
    • Selected File > Import UniProt XML...
    • Selected File > Import GO OBO-XML...
    • Clicked OK to the message asking to process the GO data.
    • Selected File > Import GOA...

Exporting a GenMAPP Gene Database (.gdb)

  • I selected File > Export to GenMAPP Gene Database... to begin the export process.
  • I typed my name in the owner field (Brandon Klein).
  • I selectedVibrio cholerae as the gene database species and then clicked Next.
  • The database was saved as Vc-Std_BK_20151027.
  • I checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
  • Finally, I clicked the "Next" button to begin the export process.

Gene Database Testing Report

Export Information

Version of GenMAPP Builder: Version 3.0.0 Build 5

Computer on which export was run: Seaver 120- Last computer on the right in the row closest to the front of the room

Postgres Database name: V.cholerae_20151027_gmb3build5

UniProt XML filename: uniprot-organism-243277_BK_20151027.xml

GO OBO-XML filename: go_daily-termdb_BK_20151027.obo-xml

  • GO OBO-XML version: File last modified on ‎Tuesday, ‎October ‎27, ‎2015, ‏‎2:24:02 AM (accessed from the file properties window)
  • GO OBO-XML download link: http://geneontology.org/page/download-ontology#Legacy_Downloads
  • Time taken to import: 6.88 minutes
  • Time taken to process: 4.49 minutes
    • Note: The import and processing times were once again similar to my homework partner's, and no interruptions occurred during these processes.

GOA filename: 46.V_cholerae_ATCC_39315_BK_20151027.goa

Name of .gdb file: Vc-Std_BK_20151027.gdb

  • Time taken to export: 1 hour, 27 minutes
    • Start time: 4:44 PM (restarted export after class)
    • End time: 6:11 PM
    • Note: Upon checking on my export progress after class, I found out that the export had prematurely stopped. Therefore, I restarted the process. When I returned to the S120 computer lab on 10/29, the export had completed successfully and all windows remained open.

TallyEngine

  • I ran the Tally Engine in GenMAPP Builder to process the number of records for UniProt and GO in the XML data and in the Postgres databases. This was done to verify that the XML data transferred consistently into the PostgreSQL database.
    • I selected the menu item Tallies > Run XML and Database Tallies for UniProt and GO... to run the Tally Engine. The results are pictured below:
    • TallyresultsBK1029.png
      • XML and database counts were consistent among all assessed data types.
      • Note: the "Ordered Locus" count was 3831.

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

On the command line ('cmd on Windows), I entered the folder Bklein7_Week9 to access the XML data files. The first match I ran was designed to identify matches of any ordered locus name following the pattern VC_#### in the UniProt XML file. The command sequence used is as follows:

java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml" >       
OrderedLocusNames.txt

The results of this match are pictured below. XmlpipedbmatchoutputBK1029.png

  • The number of unique matches yielded by XMLPipeDB Match, 2738, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries.
  • I investigated this discrepancy by opening the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access. Within Access, I opened the "OrdereLocusNames" table.
    • Quickly skimming through this tablerevealed a large number of Ordered Locus entries that did not follow the pattern VC_####. Instead, these entries followed the pattern VC_A####. Therefore, I edited the pattern I input into XMLPipeDB Match.

The second match I ran accounted for the issues above. The command sequence used is as follows:

java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml"   
> OrderedLocusNames.txt

The results of this match are pictured below. XmlpipedbmatchoutputBK1029 Updated.png

  • With this updated pattern, XMLPipeDB yielded 3831 unique matches. This matched the Tally Engine results.
    • Upon accounting for the alternate Ordered Locus entry pattern, XMLPipeDB validated the results of Tally Engine by calculating the same number of entries in the source data files used in the SQL database.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

The gene/name tags in the XML file land in the genenametype table, which contains ordered locus values. As in XMLPipeDB Match, I started with a SQL query designed to match the pattern VC_####:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';

The results of this query are pictured below. PostgresIDsBK1029.png

  • The number of unique matches yielded by this SQL query, 2737, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries. Further, it had one less entry than the original XMLPipeDB query.
    • As with the first XMLPipeDB match, the main issue here was the fact that I did not account for the occasional presence of the letter "A" in the Ordered Locus entries.

I ran a second SQL query to include the pattern VC_A#### in the count output:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';

The results of this query are pictured below. PostgresIDsBK1029 UPDATED.png

  • With the secondary pattern included, this SQL query yielded 3831 unique matches. This matched the Tally Engine results as well as the updated XMLPipeDB Match results.
    • These findings further validated the Tally Engine results.

OriginalRowCounts Comparison

I opened the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access to see if the the expected tables were listed with the expected number of records. This information was accessed in the OriginalRowCounts table.

  • A listing named "OrderedLocusNames" was present.
  • The table reported 7664 distinct Ordered Locus Names.
    • This value is equal to (3831*2)+1

Benchmark .gdb file: File:Vc-Std External 20101022.gdb

As a sanity check, I compared the information in this table to the OriginalRowCounts table present in the 2010 Vibrio cholerae gene database that we worked with last week.

  • The "OrderedLocusNames" listing was also present here.
  • The table reported the same number of distinct Ordered Locus Names: 7664.
  • Although the above value was the same, row counts for various other tables such as "GeneOntologyTree" dramatically shifted between the 2010 and 2015 databases.

OriginalRowCounts table from the benchmark and new gdb:

  • OriginalRowCounts ComparisonImage.PNG

Note: The OriginalRowCounts tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the OriginalRowCounts tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row counts are highlighted in yellow.

Visual Inspection

I visually inspected individual tables within the gene database using Microsoft Access to see if there were any problems.

  • First, I looked at the Systems table. This table includes columns detailing 35 different gene ID systems from which data was accessed.
    • Of the 35 gene ID systems, only 11 have dates listed in the date column.
      • Those that do have values entered here all share the same date: 10/27/2015. This was the day on which the gene database was created, suggesting that these entries serve as markers for when the raw data was accessed from the gene ID systems' sites.
      • The fact that the other 24 gene ID systems do not have dates listed is problematic, as older versions of the data stored on these systems may have been accessed without this verification stamp.
  • Next, I opened the UniProt, RefSeq, and OrderedLocusNames tables to see if the listed IDs were listed in the correct forms.
    • UniProt
      • The UniProt ID column includes accession numbers from the UniProt gene entries. The different possibilities for UniProt accession numbers are detailed on the UniProt website here.
      • All ID's in the UniProt table conform to the following pattern (originally found on the page linked to above): UniProt Ascension Number info.PNG
      • Thus, there are no issues with these 3789 gene IDs.

Note:

.gdb Use in GenMAPP

Note:

Putting a gene on the MAPP using the GeneFinder window

  • Try a sample ID from each of the gene ID systems. Open the Backpage and see if all of the cross-referenced IDs that are supposed to be there are there.

Note:

Creating an Expression Dataset in the Expression Dataset Manager

  • How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

Note:

Coloring a MAPP with expression data

Note:

Running MAPPFinder

Note:


Compare Gene Database to Outside Resource

I will get back to this question during the Group Final Project, as was allowed:

This section applies to the Group Final Project and does not need to be completed for the 
Week 9 assignment. —  Kdahlquist (talk) 15:46, 2 November 2015 (PST)

The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.

Links

Assignments Pages

Individual Journal Entries

Shared Journal Entries