Difference between revisions of "Bklein7 Week 9"

From LMU BioDB 2015
Jump to: navigation, search
(added screenshots)
(included times from notes & made formatting adjustments)
Line 1: Line 1:
This tutorial will take you through all of the steps for running GenMAPP Builder for the first time.
 
 
 
==Screenshots==
 
==Screenshots==
 
[[File:TallyresultsBK1029.png]]
 
[[File:TallyresultsBK1029.png]]
Line 9: Line 7:
  
 
==Pre-requisites==
 
==Pre-requisites==
 
 
This tutorial assumes that you are working in a Windows environment.  While it is possible to run GenMAPP Builder under the Mac or Linux OS, the end product, a GenMAPP-compatible Gene Database (.gdb), can only be used with the GenMAPP program, which can only be run on Windows.  [[Software_Configuration | This set of software has already been installed on the computers in the Seaver 120 computer lab.]]  If you want to perform this procedure on your own machine, you must set up your working environment with:
 
This tutorial assumes that you are working in a Windows environment.  While it is possible to run GenMAPP Builder under the Mac or Linux OS, the end product, a GenMAPP-compatible Gene Database (.gdb), can only be used with the GenMAPP program, which can only be run on Windows.  [[Software_Configuration | This set of software has already been installed on the computers in the Seaver 120 computer lab.]]  If you want to perform this procedure on your own machine, you must set up your working environment with:
  
Line 27: Line 24:
 
==Download and Extract Data Source Files==
 
==Download and Extract Data Source Files==
  
* Download UniProt XML, GOA, and GO OBO-XML files.
+
*Download UniProt XML, GOA, and GO OBO-XML files.
  
 
=== UniProt XML===
 
=== UniProt XML===
Line 107: Line 104:
 
** Navigate to the UniProt XML file that you extracted previously and click the Open button.
 
** Navigate to the UniProt XML file that you extracted previously and click the Open button.
 
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
 
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
 +
***Import Time: 2.92 minutes
 
* Select File > Import GO OBO-XML...
 
* Select File > Import GO OBO-XML...
 
** Navigate to the GO OBO-XML file that you extracted previously.  Click the Open button.
 
** Navigate to the GO OBO-XML file that you extracted previously.  Click the Open button.
 
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
 
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
 +
***Import Time: 6.88 minutes
 
* Click OK to the message asking you to process the GO data.
 
* Click OK to the message asking you to process the GO data.
 
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
 
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
 +
***Processing Time: 4.49 minutes
 
* Select File > Import GOA...
 
* Select File > Import GOA...
 
** Navigate to the GOA file that you downloaded previously and click the Import button.  This process should only take a minute or so.
 
** Navigate to the GOA file that you downloaded previously and click the Import button.  This process should only take a minute or so.
 
+
***Import Time: 0.05 minutes
 
==Exporting a GenMAPP Gene Database (.gdb)==
 
==Exporting a GenMAPP Gene Database (.gdb)==
  

Revision as of 23:25, 29 October 2015

Screenshots

TallyresultsBK1029.png PostgresIDsBK1029.png PostgresIDsBK1029 UPDATED.png XmlpipedbmatchoutputBK1029.png XmlpipedbmatchoutputBK1029 Updated.png

Pre-requisites

This tutorial assumes that you are working in a Windows environment. While it is possible to run GenMAPP Builder under the Mac or Linux OS, the end product, a GenMAPP-compatible Gene Database (.gdb), can only be used with the GenMAPP program, which can only be run on Windows. This set of software has already been installed on the computers in the Seaver 120 computer lab. If you want to perform this procedure on your own machine, you must set up your working environment with:

  1. Any tool that can unpack .gz and .zip files
    • We use 7-zip
    • Note that we have found that the native Windows utility cannot reliably unpack .gz files or .zip files containing .jar files.
  2. PostgreSQL on Windows (http://www.enterprisedb.com/products-services-training/pgdownload)
    • This tutorial was written using PostgreSQL 9.4.x.
  3. GenMAPP Builder (https://sourceforge.net/projects/xmlpipedb/files/)
  4. Java JDK 1.8 64-bit
  5. GenMAPP 2 can be downloaded here. The file to download is "GenMAPPv2Setup.exe".
  6. XMLPipeDB match utility (https://sourceforge.net/projects/xmlpipedb/files/) for counting IDs in XML files
  7. Microsoft Access or any other tool that can read .mdb files

Download and Extract Data Source Files

  • Download UniProt XML, GOA, and GO OBO-XML files.

UniProt XML

  • Go to the UniProt Complete Proteomes page.
    • Browse to the complete proteome download page for your species of interest. For example, to get to Vibrio cholerae page, first filter the list by clicking on the link for "Bacteria" under the "Superkingdom" heading.
    • Further filter the results for those species with a "Reference proteome".
    • Scroll through the results (you might need to click through several pages) until you find your organism of interest, e.g. Vibrio cholerae serotype O1 (strain ATCC 39315 / El Tor Inaba N16961).
    • Click on the link for "UniProtKB", e.g. Vibrio cholerae serotype O1 (strain ATCC 39315 / El Tor Inaba N16961).
    • Click the "Download" button at the top of the page. It will open a small dialog box. You need to select the following options:
      • Select the radio button to "Download all"
      • Choose "XML" from the "Format" drop-down menu.
      • Select the radio button for "Compressed" format.
      • Click the "Go" button.

GOA

  • This is the UniProt-GOA home page.
    • The current and previous UniProt-GOA files can be downloaded from the UniProt-GOA ftp site.
    • In the directory that appears, click the link to the "proteomes" directory.
      • Note that it may take some time to load this page.
    • Find your organism of interest and right-click on the link to download the GO annotations and select "Save target as" or "Save link as" and save the GOA file. For example, this is the link for Vibrio cholerae.
      • Note: Since the GOA file is a text file, your browser will not automatically download it when you left-click on the link. Instead, it will try to open the file in your browser window. Since it is a large file, this could take a long time if your internet connection is slow.
      • The version information can be found on displayed in the ftp file directory under the "Last modified" column. You will need this information for your Gene Database Testing Report.

GO OBO-XML

  • Download the GO OBO-XML formatted file from the Gene Ontology download page. Click on the link for "obo-xml.gz" under the heading "Legacy Downloads."
    • This file is updated daily. You can get the day/time that the file was created from the file properties after you have unzipped the file.

Extract the UniProt XML and GO OBO-XML files

  • Extract the UniProt XML and GO OBO-XML .gz files using 7-zip or other utility.

Download or Update GenMAPP Builder

  1. Visit the XMLPipeDB releases page on GitHub.
  2. Extract the GenMAPP Builder folder using 7-zip or other utility.
    • We suggest that you move the extracted folder to the "T:" drive on the computers in Seaver 120. This folder is a "Thawspace", the contents of which will not be deleted by the program Deep Freeze when the computer is restarted.

Create New Database in PostgreSQL

NOTE: if you have already performed this step and want to use GenMAPP Builder functions with a database you previously created in PostgreSQL, you can skip this step.

  • Launch pgAdmin III.
  • Double-click on PostgreSQL 9.4 (localhost:5432) on the upper left hand side of the window.
    • This is the equivalent of connecting you to the server and you may be asked for a password at this point.
  • Right click on "Databases" and Select "New Database..."
  • Give the database a name in the "Name" field and click OK.
    • Take some care in selecting a meaningful name. It is good practice to at least include the species and today's date in the name.
  • Double-left-click on your new database name in the treeview on the left.
  • Click on the SQL icon in the toolbar at the top of the window.
    • The SQL Editor tab will be open and there may be leftover query text in the upper pane. Delete this text. You are now going to use an XMLPipeDB query to create the tables in the database.
  • Click on the Open File icon in the toolbar (the yellow folder with an arrow).
  • Navigate to the folder in which you unzipped GenMAPP Builder.
  • Open the sql folder and open the file gmbuilder.sql. You should see SQL code appear in the SQL Editor tab.
  • Click the Execute Query icon which looks like a green "Play" triangle button.
  • You should get a series of NOTICE messages in the Messages tab at the bottom of the window, concluding with a message like "Query returned successfully with no result in 15583 ms" in the end. This query now created all the tables in the database (although there is still no data in them).
  • Close the query window (you don't need to save the query because you have already run it).
  • To double check that all is OK, click the + sign for the database, then the + sign for Schemas, then finally the + sign for public. Under the Tables section, you should see a count of 167 in parentheses.

Configuring GenMAPP Builder to Connect to your PostgreSQL Database

  • Launch gmbuilder.bat.
    • If the program does not detect a database configuration, you will see a message window to this effect and the configuation dialog will open automatically once you close the message window. Otherwise:
  • Select the menu item File > Configure Database...
  • Under the Database Connections tab the Database Driver defaults to PostgreSQL. Enter information in the following fields:
    • Host or address: localhost
    • Port number: 5432
    • Database name: <enter the name of the PostgreSQL database you created above>
    • Username: <enter the username of the PostgreSQL database you created above>; in S120, this username is "postgres"
    • Password: <enter the password of the PostgreSQL database you created above>; in S120, ask the instructors for the password.
  • Click the OK button.

Importing Data into the PostgreSQL Database

  • Select File > Import UniProt XML...
    • Navigate to the UniProt XML file that you extracted previously and click the Open button.
    • This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine. When the process has completed, record the elapsed time from the message window that appears.
      • Import Time: 2.92 minutes
  • Select File > Import GO OBO-XML...
    • Navigate to the GO OBO-XML file that you extracted previously. Click the Open button.
    • This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine. When the process has completed, record the elapsed time from the message window that appears.
      • Import Time: 6.88 minutes
  • Click OK to the message asking you to process the GO data.
    • This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine. When the process has completed, record the elapsed time from the message window that appears.
      • Processing Time: 4.49 minutes
  • Select File > Import GOA...
    • Navigate to the GOA file that you downloaded previously and click the Import button. This process should only take a minute or so.
      • Import Time: 0.05 minutes

Exporting a GenMAPP Gene Database (.gdb)

  • Select File > Export to GenMAPP Gene Database...
  • Type a name in the Owner field (or else it won't let you export).
    • When doing the individual exercise for the Week 9 assignment, use your own name. When doing this for your team project, use your team's name.
  • GenMAPP Builder scans your PostgreSQL database to see what species are available. Click on the species that you would like to export, then click Next to continue.
  • Create GenMAPP Database: click on the "Save GenMAPP Database File As..." button.
    • In the Save dialog box that appears, navigate to the "T:" drive, and then modify the default file name by appending your initials. Click the "Save" button
  • Leave the boxes checked for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
  • Click the "Next" button to begin teh export process.
    • Record the starting and ending times from the black console window. This will take 1-2 hours for a typical bacterial genome, depending on the size of the database, the processor speed, and available memory. Large eukaryotic genomes (like Arabidopsis thaliana) or genomes with many GO annotations (like Saccharomyces cerevisiae) can take much longer, in the range of 12-24 hours.
    • NOTE: the progress bar is not accurate.
      • Start Time: 4:44 PM (restarted export after class)
      • End Time: 6:11 PM
      • Elapsed Time: 1 hour, 27 minutes

Checking the Quality of your Exported Gene Database

  • It is a good idea to check the quality of your exported Gene Database to make sure that all of the data from the XML files made it into the PostgreSQL database and was then exported to the GenMAPP Gene Database. We have created a Gene Database Testing Report Sample to help guide you through this process.

Tally Engine

The first tool, called the Tally Engine, can be used for verifying that certain data from the XML file transferred consistently into the PostgreSQL database upon import. The Tally Engine can be found in GenMAPP Builder itself.

  1. Run PostgreSQL (via pgAdmin III on Windows) and make sure that your database is up and running.
  2. Run GenMAPP Builder and make sure that it is connected to the database (via Configure Database...).
  3. After performing an import, choose Run XML and Database Tallies for UniProt and GO....
  4. Choose the UniProt and GO files that you imported.
  5. You should see a table for selected data items, and how many of each were found.

Tally-results.png

Under the hood, the Tally Engine bases its XML counts on certain XML tags, and bases its database counts on SQL queries using count. This tool is thus primarily useful for making sure that the “raw” import worked without any errors or glitches.

My Tally Results:

TallyresultsBK1029.png

XMLPipeDB Match

XMLPipeDB Match is useful for counting data in files. Thus, in our context, you would use XMLPipeDB Match to tally stuff in XML files, with greater flexibility than with Tally Engine.

You will have to use XMLPipeDB Match from the command line. In addition, you can use this in any platform (as you have seen). Download the application from the XMLPipeDB SourceForge site and take note of the location of the xmlpipedb-match-1.1.1.jar. Then, on the command line (Terminal for Linux and Mac OS X, cmd on Windows), cd to the folder containing the XML file that you would like to check. Use XMLPipeDB Match as follows, with the parts in parentheses varying depending on your specific setup, desired pattern, and file being scanned:

java -jar (location-of-jar) "(pattern)" < (XML file)

On a Windows machine, with XMLPipeDB Match and a Vibrio cholerae XML file located on the Desktop, scanning for IDs of the form VC_####, where # represents a digit from 0 to 9, one would type, after cd-ing to the Desktop:

java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < uniprot-taxonomy%3A243277.xml

As you have seen before, this will give you a list of unique matches, with a total number at the bottom.

The trick with XMLPipeDB Match is to use the patterns well: with the database project, you will mainly be matching IDs. A desired count is an XMLPipeDB Match result whose matched ID pattern corresponds to the number of IDs found by the Tally Engine.

SQL

You can also look for counts at the SQL level, using some variation of a select count(*) query. This requires some knowledge of which table received what data. Here’s an initial tip: the gene/name tags in the XML file land in the genenametype table. A query on this table counting values from this table that were marked as ordered locus in the XML file matching the pattern VC_[0-9][0-9][0-9][0-9] would look like this:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';

In pgAdmin III, you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the SQL Editor tab, then clicking on the green triangular Play button to run.

Pgadminiii-query.png

Microsoft Access

For the GenMAPP Gene Database, you can open the .gdb in Microsoft Access and navigate its tables to find counts for various IDs. Opening the table, noting its size, and doing some sorting may help. You can also look at the OriginalRowCounts table for a summary of totals.

Again, the ideal situation is a correspondence in these numbers with what you found in XML and the relational database.

Back to the Command Line

Amidst all this, you can still use grep and wc on the command line for some basic counting. Just remember that these tools work on a line-by-line basis; useful in some cases, but not useful in others.

You can use grep and wc with the various files on the my.cs.lmu.edu server by using the curl -O command shown in the Week 6 assignment. Upload your data files to the wiki, place media links to them on your wiki page, then mouse over those live links to capture their URL (like this—look at the source to see the wiki markup), then use curl -O (whatever-the-url-is) while ssh-ed to my.cs.lmu.edu to bring that file into the server.

  • If the file is a .zip file, you can use unzip at the command line to unzip it.
  • If the file is a .gz file, you can use gunzip at the command line to uncompress that one.


Links

Assignments Pages

Individual Journal Entries

Shared Journal Entries