Difference between revisions of "Bklein7 Week 9"

From LMU BioDB 2015
Jump to: navigation, search
(edited xmlpipedb match protocol)
(added data to the sql queries validation section and deleted unnecessary sections)
Line 155: Line 155:
 
===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
 
===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
  
For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]
+
The ''gene/name'' tags in the XML file land in the ''genenametype'' table, which contains ''ordered locus'' values. As in XMLPipeDB Match, I started with a SQL query designed to match the pattern VC_####:
 
+
You can also look for counts at the SQL level, using some variation of a ''select count(*)'' query.  This requires some knowledge of which table received what data.  Here’s an initial tip: the ''gene/name'' tags in the XML file land in the ''genenametype'' table.  A query on this table counting values from this table that were marked as ''ordered locus'' in the XML file matching the pattern ''VC_[0-9][0-9][0-9][0-9]'' would look like this:
+
  
 
  select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';
 
  select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';
  
In ''pgAdmin III'', you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the ''SQL Editor'' tab, then clicking on the green triangular ''Play'' button to run.
+
The results of this query are pictured below.
 +
[[File:PostgresIDsBK1029.png]]
 +
 
 +
*The number of unique matches yielded by this SQL query, 2737, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries. Further, it had one less entry than the original XMLPipeDB query.
 +
**As with the first XMLPipeDB match, the main issue here was the fact that I did not account for the occasional presence of the letter "A" in the Ordered Locus entries.
 +
 
 +
I ran a second SQL query to include the pattern VC_A#### in the count output:
  
[[Image:Pgadminiii-query.png]]
+
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';
  
Are your results the same as reported by the TallyEngine?  Why or why not?
+
The results of this query are pictured below.
 +
[[File:PostgresIDsBK1029 UPDATED.png]]
 +
 
 +
*With the secondary pattern included, this SQL query yielded 3831 unique matches. This matched the Tally Engine results as well as the updated XMLPipeDB Match results.
 +
**These findings further validated the Tally Engine results.
  
 
===OriginalRowCounts Comparison===
 
===OriginalRowCounts Comparison===
  
Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.
+
I opened the gene database file ''Vc-Std_BK_20151027.gdb'' in Microsoft Access to asses the "OriginalRowCounts" present within table. to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.
  
Benchmark .gdb file:
+
Benchmark .gdb file: ''Vc-Std_BK_20151027.gdb'' (?)
  
Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:
+
OriginalRowCounts table from the benchmark and new gdb:
  
 
Note:
 
Note:
Line 216: Line 224:
  
 
Note:
 
Note:
 
==Checking the Quality of your Exported Gene Database==
 
 
* It is a good idea to check the quality of your exported Gene Database to make sure that all of the data from the XML files made it into the PostgreSQL database and was then exported to the GenMAPP Gene Database.  We have created a [[Gene Database Testing Report Sample]] to help guide you through this process.
 
== Tally Engine ==
 
 
The first tool, called the Tally Engine, can be used for verifying that certain data from the XML file transferred consistently into the PostgreSQL database upon import.  The Tally Engine can be found in GenMAPP Builder itself.
 
 
# Run PostgreSQL (via pgAdmin III on Windows) and make sure that your database is up and running.
 
# Run GenMAPP Builder and make sure that it is connected to the database (via ''Configure Database...'').
 
# After performing an import, choose ''Run XML and Database Tallies for UniProt and GO...''.
 
# Choose the UniProt and GO files that you imported.
 
# You should see a table for selected data items, and how many of each were found.
 
 
[[Image:Tally-results.png]]
 
 
Under the hood, the Tally Engine bases its XML counts on certain XML tags, and bases its database counts on SQL queries using ''count''.  This tool is thus primarily useful for making sure that the “raw” import worked without any errors or glitches.
 
 
My Tally Results:
 
 
[[File:TallyresultsBK1029.png]]
 
 
== XMLPipeDB Match ==
 
 
XMLPipeDB Match is useful for counting data in files.  Thus, in our context, you would use XMLPipeDB Match to tally stuff in XML files, with greater flexibility than with Tally Engine.
 
 
You will have to use XMLPipeDB Match from the command line.  In addition, you can use this in any platform (as you have seen).  Download the application from the [http://sourceforge.net/projects/xmlpipedb XMLPipeDB SourceForge site] and take note of the location of the ''xmlpipedb-match-1.1.1.jar''.  Then, on the command line (''Terminal'' for Linux and Mac OS X, ''cmd'' on Windows), ''cd'' to the folder containing the XML file that you would like to check.  Use XMLPipeDB Match as follows, with the parts in parentheses varying depending on your specific setup, desired pattern, and file being scanned:
 
 
java -jar (location-of-jar) "(pattern)" < (XML file)
 
 
On a Windows machine, with XMLPipeDB Match and a ''Vibrio cholerae'' XML file located on the Desktop, scanning for IDs of the form '''VC_####''', where '''#''' represents a digit from 0 to 9, one would type, after ''cd''-ing to the Desktop:
 
 
java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < uniprot-taxonomy%3A243277.xml
 
 
As you have seen before, this will give you a list of unique matches, with a total number at the bottom.
 
 
The trick with XMLPipeDB Match is to use the patterns well: with the database project, you will mainly be matching IDs.  A desired count is an XMLPipeDB Match result whose matched ID pattern corresponds to the number of IDs found by the Tally Engine.
 
 
== SQL ==
 
 
You can also look for counts at the SQL level, using some variation of a ''select count(*)'' query.  This requires some knowledge of which table received what data.  Here&rsquo;s an initial tip: the ''gene/name'' tags in the XML file land in the ''genenametype'' table.  A query on this table counting values from this table that were marked as ''ordered locus'' in the XML file matching the pattern ''VC_[0-9][0-9][0-9][0-9]'' would look like this:
 
 
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';
 
 
In ''pgAdmin III'', you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the ''SQL Editor'' tab, then clicking on the green triangular ''Play'' button to run.
 
 
[[Image:pgadminiii-query.png]]
 
 
== Microsoft Access ==
 
 
For the GenMAPP Gene Database, you can open the ''.gdb'' in Microsoft Access and navigate its tables to find counts for various IDs.  Opening the table, noting its size, and doing some sorting may help.  You can also look at the ''OriginalRowCounts'' table for a summary of totals.
 
 
Again, the ideal situation is a correspondence in these numbers with what you found in XML and the relational database.
 
 
== Back to the Command Line ==
 
 
Amidst all this, you can still use ''grep'' and ''wc'' on the command line for some basic counting.  Just remember that these tools work on a line-by-line basis; useful in some cases, but not useful in others.
 
 
You can use ''grep'' and ''wc'' with the various files on the ''my.cs.lmu.edu'' server by using the ''curl -O'' command shown in the [[Week 6]] assignment. Upload your data files to the wiki, place media links to them on your wiki page, then mouse over those live links to capture their URL ([[Media:Movie.sql.txt|like this—look at the source to see the wiki markup]]), then use ''curl -O (whatever-the-url-is)'' while ''ssh''-ed to ''my.cs.lmu.edu'' to bring that file into the server.
 
 
* If the file is a ''.zip'' file, you can use ''unzip'' at the command line to unzip it.
 
* If the file is a ''.gz'' file, you can use ''gunzip'' at the command line to uncompress that one.
 
 
<!-- Future content: advanced techniques for diff'ing grep/match results against psql results, duly processed by sed. -->
 
  
 
==Links==
 
==Links==

Revision as of 08:59, 2 November 2015

Screenshots

TallyresultsBK1029.png PostgresIDsBK1029.png PostgresIDsBK1029 UPDATED.png XmlpipedbmatchoutputBK1029.png XmlpipedbmatchoutputBK1029 Updated.png

Pre-requisites

The following set of software was used in the creation and testing of the 'Vibrio cholerae gene database:

  1. 7-ziptool that for unpacking .gz and .zip files
  2. PostgreSQL on Windows (version 9.4.x)
  3. GenMAPP Builder
  4. Java JDK 1.8 64-bit
  5. GenMAPP 2
  6. XMLPipeDB match utility for counting IDs in XML files
  7. Microsoft Access for reading .mdb files

Gene Database Creation

Downloading Data Source Files and GenMAPP Builder

  • I download the UniProt XML, GOA, and GO OBO-XML files for Vibrio choleraealong with the GenMAPP Builder program.
    • All files were saved to the folder Bklein7_Week9 on my computer's ThawSpace.
    • Files that required extraction were unzipped using 7-zip.
    • Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

UniProt XML

GOA

GO OBO-XML

Downloaded GenMAPP Builder

  1. I downloaded the GenMAPP Builder zip folder: Download gmbuilder-3.0.0-build-5.zip.
  2. I extracted the GenMAPP Builder folder using 7-zip.

Creating the New Database in PostgreSQL

  • I launched pgAdmin III and connected to the PostgreSQL 9.4 server (localhost:5432).
    • On this server, I created a new database: V.cholerae_20151027_gmb3build5.
    • I opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
      • I clicked on the Open File icon and selected the file gmbuilder.sql. This imported a series of SQL commands into the editor tab.
      • I clicked on the Execute Query icon to run this command.
      • In viewing the schema for this database, I confirmed that there were 167 tables after running the above command.

Configuring GenMAPP Builder to Connect to the PostgreSQL Database

  • To begin, I launched gmbuilder.bat.
  • I selected the "Configure Database" option and entered the following information into the fields below:
    • Host or address: localhost
    • Port number: 5432
    • Database name: V.cholerae_20151027_gmb3build5
    • Username: postgres
    • Password: Welcome1

Importing Data into the PostgreSQL Database

  • The downloaded data files for Vibrio cholerea were specified and imported into the database by clicking on the following buttons:
    • Selected File > Import UniProt XML...
    • Selected File > Import GO OBO-XML...
    • Clicked OK to the message asking to process the GO data.
    • Selected File > Import GOA...

Exporting a GenMAPP Gene Database (.gdb)

  • I selected File > Export to GenMAPP Gene Database... to begin the export process.
  • I typed my name in the owner field (Brandon Klein).
  • I selectedVibrio cholerae as the gene database species and then clicked Next.
  • The database was saved as Vc-Std_BK_20151027.
  • I checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
  • Finally, I clicked the "Next" button to begin the export process.

Gene Database Testing Report

Export Information

Version of GenMAPP Builder: Version 3.0.0 Build 5

Computer on which export was run: Seaver 120- Last computer on the right in the row closest to the front of the room

Postgres Database name: V.cholerae_20151027_gmb3build5

UniProt XML filename: uniprot-organism-243277_BK_20151027.xml

GO OBO-XML filename: go_daily-termdb_BK_20151027.obo-xml

  • GO OBO-XML version: File last modified on ‎Tuesday, ‎October ‎27, ‎2015, ‏‎2:24:02 AM (accessed from the file properties window)
  • GO OBO-XML download link: http://geneontology.org/page/download-ontology#Legacy_Downloads
  • Time taken to import: 6.88 minutes
  • Time taken to process: 4.49 minutes
    • Note: The import and processing times were once again similar to my homework partner's, and no interruptions occurred during these processes.

GOA filename: 46.V_cholerae_ATCC_39315_BK_20151027.goa

Name of .gdb file: Vc-Std_BK_20151027.gdb

  • Time taken to export: 1 hour, 27 minutes
    • Start time: 4:44 PM (restarted export after class)
    • End time: 6:11 PM
    • Note: Upon checking on my export progress after class, I found out that the export had prematurely stopped. Therefore, I restarted the process. When I returned to the S120 computer lab on 10/29, the export had completed successfully and all windows remained open.

TallyEngine

  • I ran the Tally Engine in GenMAPP Builder to process the number of records for UniProt and GO in the XML data and in the Postgres databases. This was done to verify that the XML data transferred consistently into the PostgreSQL database.
    • I selected the menu item Tallies > Run XML and Database Tallies for UniProt and GO... to run the Tally Engine. The results are pictured below:
    • TallyresultsBK1029.png
      • XML and database counts were consistent among all assessed data types.
      • Note: the "Ordered Locus" count was 3831.

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

On the command line ('cmd on Windows), I entered the folder Bklein7_Week9 to access the XML data files. The first match I ran was designed to identify matches of any ordered locus name following the pattern VC_#### in the UniProt XML file. The command sequence used is as follows:

java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml" >       
OrderedLocusNames.txt

The results of this match are pictured below. XmlpipedbmatchoutputBK1029.png

  • The number of unique matches yielded by XMLPipeDB Match, 2738, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries.
  • I investigated this discrepancy by opening the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access. Within Access, I opened the "OrdereLocusNames" table.
    • Quickly skimming through this tablerevealed a large number of Ordered Locus entries that did not follow the pattern VC_####. Instead, these entries followed the pattern VC_A####. Therefore, I edited the pattern I input into XMLPipeDB Match.

The second match I ran accounted for the issues above. The command sequence used is as follows:

java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml"   
> OrderedLocusNames.txt

The results of this match are pictured below. XmlpipedbmatchoutputBK1029 Updated.png

  • With this updated pattern, XMLPipeDB yielded 3831 unique matches. This matched the Tally Engine results.
    • Upon accounting for the alternate Ordered Locus entry pattern, XMLPipeDB validated the results of Tally Engine by calculating the same number of entries in the source data files used in the SQL database.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

The gene/name tags in the XML file land in the genenametype table, which contains ordered locus values. As in XMLPipeDB Match, I started with a SQL query designed to match the pattern VC_####:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';

The results of this query are pictured below. PostgresIDsBK1029.png

  • The number of unique matches yielded by this SQL query, 2737, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries. Further, it had one less entry than the original XMLPipeDB query.
    • As with the first XMLPipeDB match, the main issue here was the fact that I did not account for the occasional presence of the letter "A" in the Ordered Locus entries.

I ran a second SQL query to include the pattern VC_A#### in the count output:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';

The results of this query are pictured below. PostgresIDsBK1029 UPDATED.png

  • With the secondary pattern included, this SQL query yielded 3831 unique matches. This matched the Tally Engine results as well as the updated XMLPipeDB Match results.
    • These findings further validated the Tally Engine results.

OriginalRowCounts Comparison

I opened the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access to asses the "OriginalRowCounts" present within table. to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.

Benchmark .gdb file: Vc-Std_BK_20151027.gdb (?)

OriginalRowCounts table from the benchmark and new gdb:

Note:

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

  • Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
  • Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?

Note:

.gdb Use in GenMAPP

Note:

Putting a gene on the MAPP using the GeneFinder window

  • Try a sample ID from each of the gene ID systems. Open the Backpage and see if all of the cross-referenced IDs that are supposed to be there are there.

Note:

Creating an Expression Dataset in the Expression Dataset Manager

  • How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

Note:

Coloring a MAPP with expression data

Note:

Running MAPPFinder

Note:


Compare Gene Database to Outside Resource

The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.

Note:

Links

Assignments Pages

Individual Journal Entries

Shared Journal Entries