Emilysimso Week 9
Contents
- 1 Directions
- 2 Notes
- 3 TallyEngine
- 4 Using XMLPipeDB match to Validate the XML Results from the TallyEngine
- 5 Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine
- 6 OriginalRowCounts Comparison
- 7 Visual Inspection
- 8 .gdb Use in GenMAPP
- 9 Compare Gene Database to Outside Resource
Directions
Download and Extract Data Source Files
- Downloaded UniProt XML, GOA, and GO OBO-XML files.
UniProt XML
- Went to the UniProt Complete Proteomes page.
- Browsed to get to Vibrio cholerae page, first filtered the list by clicking on the link for "Bacteria" under the "Superkingdom" heading.
- Further filtered the results for those species with a "Reference proteome".
- Scrolled through the results until I found Vibrio cholerae serotype O1 (strain ATCC 39315 / El Tor Inaba N16961).
- Clicked on the link for "UniProtKB", e.g. Vibrio cholerae serotype O1 (strain ATCC 39315 / El Tor Inaba N16961).
- Clicked the "Download" button at the top of the page and selected the following options:
- Selected the radio button to "Download all"
- Chose "XML" from the "Format" drop-down menu.
- Selected the radio button for "Compressed" format.
- Clicked the "Go" button.
GOA
- This is the UniProt-GOA home page.
- The current and previous UniProt-GOA files can be downloaded from the UniProt-GOA ftp site.
- In the directory that appears, clicked the link to the "proteomes" directory.
- Note that it may take some time to load this page.
- Found my organism of interest and right-clicked on the link to download the GO annotations and selected "Save target as" or "Save link as" and saved the GOA file. For example, this is the link for Vibrio cholerae.
- Note: Since the GOA file is a text file, your browser will not automatically download it when you left-click on the link. Instead, it will try to open the file in your browser window. Since it is a large file, this could take a long time if your internet connection is slow.
- The version information can be found on displayed in the ftp file directory under the "Last modified" column (Needed for my Gene Database Testing Report)
GO OBO-XML
- Downloaded the GO OBO-XML formatted file from the Gene Ontology download page. Clicked on the link for "obo-xml.gz" under the heading "Legacy Downloads."
- This file is updated daily. You can get the day/time that the file was created from the file properties after you have unzipped the file.
Extract the UniProt XML and GO OBO-XML files
- Extracted the UniProt XML and GO OBO-XML .gz files using 7-zip
Download or Update GenMAPP Builder
- Visited the XMLPipeDB releases page on GitHub.
- Extracted the GenMAPP Builder folder using 7-zip or other utility.
- We suggest that you move the extracted folder to the "T:" drive on the computers in Seaver 120. This folder is a "Thawspace", the contents of which will not be deleted by the program Deep Freeze when the computer is restarted.
Create New Database in PostgreSQL
NOTE: if you have already performed this step and want to use GenMAPP Builder functions with a database you previously created in PostgreSQL, you can skip this step.
- Launched pgAdmin III.
- Double-clicked on PostgreSQL 9.4 (localhost:5432) on the upper left hand side of the window.
- Right clicked on "Databases" and Selected "New Database..."
- Gave the database a name in the "Name" field and clicked OK (named Esimso V. cholerae 20151027 GVMBbuild5)
- Double-left-clicked on my new database name in the treeview on the left.
- Clicked on the SQL icon in the toolbar at the top of the window.
- The SQL Editor tab opened and there was leftover query text in the upper pane. Deleted this text. You are now going to use an XMLPipeDB query to create the tables in the database.
- Clicked on the Open File icon in the toolbar (the yellow folder with an arrow).
- Navigated to the folder in which I unzipped GenMAPP Builder.
- Opened the sql folder and opened the file gmbuilder.sql.
- Clicked the Execute Query icon, which looks like a green "Play" triangle button.
- This query now created all the tables in the database (although there is still no data in them).
- Closed the query window
- To double check that all was OK, clicked the + sign for the database, then the + sign for Schemas, then finally the + sign for public. Under the Tables section, I saw a count of 167 in parentheses.
Configuring GenMAPP Builder to Connect to your PostgreSQL Database
- Launched gmbuilder.bat.
- Selected the menu item File > Configure Database...
- Under the Database Connections tab the Database Driver defaults to PostgreSQL. Entered information in the following fields:
- Host or address: localhost
- Port number: 5432
- Database name: <enter the name of the PostgreSQL database you created above>
- Username: <enter the username of the PostgreSQL database you created above>; in S120, this username is "postgres"
- Password: <enter the password of the PostgreSQL database you created above>; in S120, ask the instructors for the password.
- Clicked the OK button.
Importing Data into the PostgreSQL Database
- Selected File > Import UniProt XML...
- Navigated to the UniProt XML file that I extracted previously and clicked the Open button.
- This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine. When the process completed, recorded the elapsed time from the message window that appears.
- Selected File > Import GO OBO-XML...
- Navigated to the GO OBO-XML file that I extracted previously. Clicked the Open button.
- This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine. When the process completed, recorded the elapsed time from the message window that appears.
- Clicked OK to the message asking me to process the GO data.
- This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine. When the process completed, recorded the elapsed time from the message window that appears.
- Selected File > Import GOA...
- Navigated to the GOA file that I downloaded previously and clicked the Import button. This process only took a minute or so.
Notes
Version of GenMAPP Builder: gmb3build5
Computer on which export was run: HP LV2311
Postgres Database name: Esimso V. cholerae 20151027 GVMBbuild5
UniProt XML filename - uniprot-organism%3A243277.xml NEED LINK:
- UniProt XML version (The version information can be found at the UniProt News Page):
- UniProt XML download link:
- Time taken to import: 3.14 minutes
GO OBO-XML filename - go_daily-termdb.obo-xml NEED LINK:
- GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the GO Download page has been unzipped):
- GO OBO-XML download link:
- Time taken to import: 7.36 minutes
- Time taken to process: 4.37 minutes
- Note: took a long time to import
GOA filename - 46.V_cholerae_ATCC_39315.goa NEED LINK:
- GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site):
- GOA download link:
- Time taken to import: 0.07 minutes
- Note:
Name of .gdb file - Vc-Std_20151027_ES.gdb NEED LINK:
- Time taken to export:
- Start time: 3:52:08 PM
- End time:
- Note: had to leave due to class ending
TallyEngine
- Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
- Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
- Take a screenshot of the results. Upload the image to the wiki and display it on this page.
- For more information, see this page.
Using XMLPipeDB match to Validate the XML Results from the TallyEngine
Follow the instructions found on this page to run XMLPipeDB match.
Are your results the same as you got for the TallyEngine? Why or why not?
Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine
For more information, see this page.
You can also look for counts at the SQL level, using some variation of a select count(*) query. This requires some knowledge of which table received what data. Here’s an initial tip: the gene/name tags in the XML file land in the genenametype table. A query on this table counting values from this table that were marked as ordered locus in the XML file matching the pattern VC_[0-9][0-9][0-9][0-9] would look like this:
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';
In pgAdmin III, you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the SQL Editor tab, then clicking on the green triangular Play button to run.
Are your results the same as reported by the TallyEngine? Why or why not?
OriginalRowCounts Comparison
Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.
Benchmark .gdb file:
Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:
Note:
Visual Inspection
Perform visual inspection of individual tables to see if there are any problems.
- Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
- Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
Note:
.gdb Use in GenMAPP
Note:
Putting a gene on the MAPP using the GeneFinder window
- Try a sample ID from each of the gene ID systems. Open the Backpage and see if all of the cross-referenced IDs that are supposed to be there are there.
Note:
Creating an Expression Dataset in the Expression Dataset Manager
- How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?
Note:
Coloring a MAPP with expression data
Note:
Running MAPPFinder
Note:
Compare Gene Database to Outside Resource
The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.
Note: