Difference between revisions of "Ksherbina Week 9"
(→Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine: Recorded the results.) |
(→Check the Quality of the Exported Database: Added steps on how to use the .gdb file in Excel.) |
||
Line 171: | Line 171: | ||
*Count the number of unique gene IDs using the following query: | *Count the number of unique gene IDs using the following query: | ||
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_(A|)[0-9][0-9][0-9][0-9]' | select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_(A|)[0-9][0-9][0-9][0-9]' | ||
+ | *Open Microsoft Access. | ||
+ | *Open the .gdb file (Vc-Std_KS_20131022.gdb) that was exported using GenMAPP Builder. | ||
+ | *In the Tables window, double-click on the table named ''OriginalRowCounts''. |
Revision as of 06:34, 25 October 2013
Assignment Description | Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | Week 10 | Week 11 | Week 12 | Week 13 | Week 15 |
Class Journal | Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | |||||
Individual Journal | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | Week 10 | Week 11 |
Other | Week 5: Database Wiki |
Final Project | Team H(oo)KD Project Page | Journal Club Presentation | Project Individual Journal |
Gene Database Testing Report
The following records the results of going through the procedure outlined in the Lab Notebook section below.
Export Information
Version of GenMAPP Builder: 2.0b70
Computer on which export was run: Personal computer
Postgres Database name: VC_KS_20131022_gmb2b70
UniProt XML filename: UniProt_V_cholerae_KSTV_20131022.xml
- UniProt XML version (The version information can be found at the UniProt News Page): UniProt release 2013_10
- Time taken to import: 4.20 minutes
GO OBO-XML filename: go_daily-termdb_KS_20131022.obo-xml
- GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the GO Download page has been unzipped):
- Time taken to import: 8.81 minutes
- Time taken to process: 7.33 minutes
GOA filename: 46.V_cholerae_ATCC_39315_KS_20131022.goa
- GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site): Latest version released 10/16/2013
- Time taken to import: 0.10 minutes
Name of .gdb file: Vc-Std_KS_20131022.gdb
- Time taken to export .gdb:
- Start Time: 6:50 pm
- End Time: 9:22 pm
- Link to file: Vc-Std KS 20131022.gdb
Note:
TallyEngine
Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the PostgreSQL databases (or you can upload and link to a screenshot of the results):
Using XMLPipeDB match to Validate the XML Results from the TallyEngine
Follow the instructions found on this page to run XMLPipeDB match.
Are your results the same as you got for the TallyEngine? Why or why not?
- When searching for gene IDs with the pattern "VC_(A|)[0-9][0-9][0-9][0-9]", the results were the same as those for TallyEngine. In both cases, 3831 matches were found. However, when first searching for the pattern "VC_[0-9][0-9][0-9][0-9]", only 2738 matches were found.
Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine
Follow the instructions on this page to query the PostgreSQL Database.
- There were 3831 matches found using an SQL query (the same as with TallyEngine).
OriginalRowCounts Comparison
Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.
Benchmark .gdb file: (for the Week 9 Assignment, use the "Vc-Std_External_20101022.gdb" as your benchmark, downloadable from here.
Copy the OriginalRowCounts table and paste it here:
Note:
Visual Inspection
Perform visual inspection of individual tables to see if there are any problems.
- Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
- Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
Note:
.gdb Use in GenMAPP
Note:
Putting a gene on the MAPP using the GeneFinder window
- Try a sample ID from each of the gene ID systems. Open the Backpage and see if all of the cross-referenced IDs that are supposed to be there are there.
Note:
Creating an Expression Dataset in the Expression Dataset Manager
- How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?
Note:
Coloring a MAPP with expression data
Note:
Running MAPPFinder
Note:
Compare Gene Database to Outside Resource
The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.
Note:
Lab Notebook
- Download GenMAPP Builder 2.0b70 from http://sourceforge.net/projects/xmlpipedb/files/GenMAPP%20Builder/
- When the zipped file has downloaded, unzip the folder with 7-zip.
- Download XMLPipeDB Match from http://sourceforge.net/projects/xmlpipedb/files/XMLPipeDB%20Match/
- When the zipped file has downloaded, unzip the folder with 7-zip.
- Download the UniProt XML file for Vibrio cholerae
- Go to the page http://www.uniprot.org/uniprot/?query=organism:243277+keyword:1185
- Click the orange Download link in the upper right-hand corner of the page.
- Download the XML file to the "Downloads" folder on the computer.
- Download the file 46.V_cholerae_ATCC_39315.goa to the "Downloads" folder on the computer.
- Download the GO OBO-XML formatted file from the Gene Ontology download page to the "Downloads" folder on the computer. Click on the link for obo-xml.gz.
- When the zipped file has downloaded, unzip the folder with 7-zip.
Import Data into the PostgreSQL Database
- Launch pgAdmin III.
- Double-click on PostgreSQL 9.2 (localhost:5432) on the upper left hand side of the window. Type in the password to connect.
- Right click on Databases and select New Database...
- Name the new database in the following format and click OK: VC_<your initials>_20131022_gmb2b70 (ex. VC_KS_20131022_gmb2b70).
- Click on your new database name in the treeview on the left.
- Click on the SQL icon in the toolbar at the top of the window.
- Click on the Open File icon in the toolbar (the yellow folder with an arrow).
- Navigate to the folder in which you unzipped GenMAPP Builder.
- Open the sql folder and open the file gmbuilder.sql. You should see SQL code appear in the SQL Editor tab.
- Click the Execute Query icon which looks like a green “Play” triangle button.
- Close the query window.
- To double check that all is OK, click the + sign for the database, then the + sign for Schemas, then finally the + sign for public. Under the Tables section, you should see a count of 159 in parentheses.
Export a GenMAPP Gene Database Using GenMAPP Builder
- Keep pgAdminIII open. Go to the gmbuilder-2.0b70 folder in the Downloads folder and launch gmbuilder-32bit.bat
- Select the menu item File > Configure Database...
- Under the Database Connections tab,
- The Database Driver defaults to PostgreSQL. Enter the following information into the corresponding fields:
- Host or address: localhost
- Port number: 5432
- Database name: VC_KS_20131022_gmb2b70
- Username: postgres
- Password: <enter the password of the PostgreSQL database you created above>
- Click the OK button.
- Select File > Import UniProt XML...
- Navigate to the UniProt XML file that you extracted previously and click the Open button.
- Select File > Import GO OBO-XML...
- Navigate to the GO OBO-XML file that you extracted previously. Click the Open button.
- Click OK to the message asking you to process the GO data.
- Select File > Import GOA...
- Navigate to the GOA file that you downloaded previously and click the Import button.
- Select File > Export to GenMAPP Gene Database...
- Type a name in the Owner field (ex. LMU_Fall2013_BIOL367_KS).
- GenMAPP Builder scans your PostgreSQL database to see what species are available. Click on the species that you would like to export, then click Next to continue.
- Click on the Save GenMAPP Database File As... button. Make sure that the Downloads folder appears. Modify the default file name to include your initials and then click on Save.
- Click the Next button to start the export process.
Check the Quality of the Exported Database
- After performing an export, choose Run XML and Database Tallies for UniProt and GO....
- Choose the UniProt and GO files that you imported.
- Open the Command Prompt on your computer.
- Change the directory to where the XMLPipeDB match tool is located.
cd C:\Users\Katrina\Downloads
- Use XMLPipeDB Match to count the number of unique gene ideas in the V. cholerae Uniprot XML file.
java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < UniProt_V_cholerae_KSTV_20131022
- Since some of the gene IDs are in the format "VC_A####" rather than "VC_####", try the following command to determine the number of unique gene IDs.
java -jar xmlpipedb-match-1.1.1.jar "VC_(A|)[0-9][0-9][0-9][0-9]" < UniProt_V_cholerae_KSTV_20131022
- Go back to pgAdminIII (it should still be open).
- Click on the button to execute SQL queries.
- Count the number of unique gene IDs using the following query:
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_(A|)[0-9][0-9][0-9][0-9]'
- Open Microsoft Access.
- Open the .gdb file (Vc-Std_KS_20131022.gdb) that was exported using GenMAPP Builder.
- In the Tables window, double-click on the table named OriginalRowCounts.