Difference between revisions of "Bklein7 Week 9"

From LMU BioDB 2015
Jump to: navigation, search
(added/edited Downloading Data Source Files and GenMAPP Builder)
(added apostrophe)
 
(31 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Screenshots==
+
==Files Asked for in the Gene Database Testing Report==
[[File:TallyresultsBK1029.png]]
+
For convenience, all of the files explicitly asked for in the "Gene Database Testing Report" section were compressed together in this file: [[File:Bklein7 Week9 Files.zip]].
[[File:PostgresIDsBK1029.png]]
+
[[File:PostgresIDsBK1029 UPDATED.png]]
+
[[File:XmlpipedbmatchoutputBK1029.png]]
+
[[File:XmlpipedbmatchoutputBK1029 Updated.png]]
+
  
 
==Pre-requisites==
 
==Pre-requisites==
The following set of software was used in the creation and testing of the 'Vibrio cholerae'' gene database:
+
The following set of software was used in the creation and testing of the ''Vibrio cholerae'' gene database:
  
 
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
 
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
Line 20: Line 16:
 
===Downloading Data Source Files and GenMAPP Builder===
 
===Downloading Data Source Files and GenMAPP Builder===
  
*I download the UniProt XML, GOA, and GO OBO-XML files for ''Vibrio cholerae''along with the GenMAPP Builder program.
+
*I download the UniProt XML, GOA, and GO OBO-XML files for ''Vibrio cholerae'' along with the GenMAPP Builder program.
 
**All files were saved to the folder ''Bklein7_Week9'' on my computer's ThawSpace.
 
**All files were saved to the folder ''Bklein7_Week9'' on my computer's ThawSpace.
 
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
 
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
Line 41: Line 37:
 
**This text file was automatically opened by my browser. Therefore, I had to manually download the file.
 
**This text file was automatically opened by my browser. Therefore, I had to manually download the file.
  
===GO OBO-XML===
+
====GO OBO-XML====
  
 
* I downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
 
* I downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
Line 51: Line 47:
 
# I extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].
 
# I extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].
  
==Create New Database in PostgreSQL==
+
===Creating the New Database in PostgreSQL===
  
NOTE: if you have already performed this step and want to use GenMAPP Builder functions with a database you previously created in PostgreSQL, you can skip this step.
+
* I launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
 +
** On this server, I created a new database: ''V.cholerae_20151027_gmb3build5''.
 +
** I opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
 +
*** I clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
 +
*** I clicked on the Execute Query icon to run this command.
 +
***In viewing the schema for this database, I confirmed that there were 167 tables after running the above command.
  
* Launch pgAdmin III.
+
===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===
* Double-click on PostgreSQL 9.4 (localhost:5432) on the upper left hand side of the window.
+
** This is the equivalent of connecting you to the server and you may be asked for a password at this point.
+
* Right click on "Databases" and Select "New Database..."
+
* Give the database a name in the "Name" field and click OK.
+
** Take some care in selecting a meaningful name.  It is good practice to at least include the species and today's date in the name.
+
* Double-left-click on your new database name in the treeview on the left.
+
* Click on the SQL icon in the toolbar at the top of the window.
+
** The SQL Editor tab will be open and there may be leftover query text in the upper pane.  Delete this text.  You are now going to use an XMLPipeDB query to create the tables in the database.
+
* Click on the Open File icon in the toolbar (the yellow folder with an arrow).
+
* Navigate to the folder in which you unzipped GenMAPP Builder.
+
* Open the ''sql'' folder and open the file ''gmbuilder.sql''.  You should see SQL code appear in the SQL Editor tab.
+
* Click the Execute Query icon which looks like a green "Play" triangle button.
+
* You should get a series of NOTICE messages in the Messages tab at the bottom of the window, concluding with a message like "Query returned successfully with no result in 15583 ms" in the end.  This query now created all the tables in the database (although there is still no data in them).
+
* Close the query window (you don't need to save the query because you have already run it).
+
* To double check that all is OK, click the + sign for the database, then the + sign for Schemas, then finally the + sign for public.  Under the Tables section, you should see a count of 167 in parentheses.
+
  
==Configuring GenMAPP Builder to Connect to your PostgreSQL Database==
+
* To begin, I launched gmbuilder.bat.
 
+
* I selected the "Configure Database" option and entered the following information into the fields below:
* Launch gmbuilder.bat.
+
** If the program does not detect a database configuration, you will see a message window to this effect and the configuation dialog will open automatically once you close the message window.  Otherwise:
+
* Select the menu item File > Configure Database...
+
* Under the Database Connections tab the Database Driver defaults to PostgreSQL.  Enter information in the following fields:
+
 
** Host or address: localhost
 
** Host or address: localhost
 
** Port number: 5432
 
** Port number: 5432
** Database name: <enter the name of the PostgreSQL database you created above>
+
** Database name: V.cholerae_20151027_gmb3build5
** Username: <enter the username of the PostgreSQL database you created above>; in S120, this username is "postgres"
+
** Username: postgres
** Password: <enter the password of the PostgreSQL database you created above>; in S120, ask the instructors for the password.
+
** Password: Welcome1
* Click the OK button.
+
  
==Importing Data into the PostgreSQL Database==
+
===Importing Data into the PostgreSQL Database===
  
* Select File > Import UniProt XML...
+
*The downloaded data files for ''Vibrio cholerea'' were specified and imported into the database by clicking on the following buttons:
** Navigate to the UniProt XML file that you extracted previously and click the Open button.
+
** Selected File > Import UniProt XML...
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
+
** Selected File > Import GO OBO-XML...
***Import Time: 2.92 minutes
+
** Clicked OK to the message asking to process the GO data.
* Select File > Import GO OBO-XML...
+
** Selected File > Import GOA...
** Navigate to the GO OBO-XML file that you extracted previously. Click the Open button.
+
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
+
***Import Time: 6.88 minutes
+
* Click OK to the message asking you to process the GO data.
+
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
+
***Processing Time: 4.49 minutes
+
* Select File > Import GOA...
+
** Navigate to the GOA file that you downloaded previously and click the Import button.  This process should only take a minute or so.
+
***Import Time: 0.06 minutes
+
==Exporting a GenMAPP Gene Database (.gdb)==
+
  
* Select File > Export to GenMAPP Gene Database...
+
===Exporting a GenMAPP Gene Database (.gdb)===
* Type a name in the Owner field (or else it won't let you export).
+
** When doing the individual exercise for the [[Week 9]] assignment, use your own name.  When doing this for your team project, use your team's name.
+
* GenMAPP Builder scans your PostgreSQL database to see what species are available.  Click on the species that you would like to export, then click ''Next'' to continue.
+
* Create GenMAPP Database: click on the "Save GenMAPP Database File As..." button.
+
** In the Save dialog box that appears, navigate to the "T:" drive, and then modify the default file name by appending your initials.  Click the "Save" button
+
* Leave the boxes checked for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
+
* Click the "Next" button to begin teh export process.
+
** Record the starting and ending times from the black console window.  This will take 1-2 hours for a typical bacterial genome, depending on the size of the database, the processor speed, and available memory.  Large eukaryotic genomes (like ''Arabidopsis thaliana'') or genomes with many GO annotations (like ''Saccharomyces cerevisiae'') can take much longer, in the range of 12-24 hours.
+
**NOTE:  the progress bar is not accurate.
+
***Start Time: 4:44 PM (restarted export after class)
+
***End Time: 6:11 PM
+
***Elapsed Time: 1 hour, 27 minutes
+
 
+
==Importing Data into the PostgreSQL Database==
+
 
+
* Select File > Import UniProt XML...
+
** Navigate to the UniProt XML file that you extracted previously and click the Open button.
+
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
+
***Import Time: 2.92 minutes
+
* Select File > Import GO OBO-XML...
+
** Navigate to the GO OBO-XML file that you extracted previously.  Click the Open button.
+
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
+
***Import Time: 6.88 minutes
+
* Click OK to the message asking you to process the GO data.
+
** This should take about 5-10 minutes, but may take longer depending on the size of the file, processor speed, and available memory of the machine.  When the process has completed, record the elapsed time from the message window that appears.
+
***Processing Time: 4.49 minutes
+
* Select File > Import GOA...
+
** Navigate to the GOA file that you downloaded previously and click the Import button.  This process should only take a minute or so.
+
***Import Time: 0.06 minutes
+
==Exporting a GenMAPP Gene Database (.gdb)==
+
 
+
* Select File > Export to GenMAPP Gene Database...
+
* Type a name in the Owner field (or else it won't let you export).
+
** When doing the individual exercise for the [[Week 9]] assignment, use your own name.  When doing this for your team project, use your team's name.
+
* GenMAPP Builder scans your PostgreSQL database to see what species are available.  Click on the species that you would like to export, then click ''Next'' to continue.
+
* Create GenMAPP Database: click on the "Save GenMAPP Database File As..." button.
+
** In the Save dialog box that appears, navigate to the "T:" drive, and then modify the default file name by appending your initials.  Click the "Save" button
+
* Leave the boxes checked for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
+
* Click the "Next" button to begin teh export process.
+
** Record the starting and ending times from the black console window.  This will take 1-2 hours for a typical bacterial genome, depending on the size of the database, the processor speed, and available memory.  Large eukaryotic genomes (like ''Arabidopsis thaliana'') or genomes with many GO annotations (like ''Saccharomyces cerevisiae'') can take much longer, in the range of 12-24 hours.
+
**NOTE:  the progress bar is not accurate.
+
***Start Time: 4:44 PM (restarted export after class)
+
***End Time: 6:11 PM
+
***Elapsed Time: 1 hour, 27 minutes
+
  
 +
* I selected File > Export to GenMAPP Gene Database... to begin the export process.
 +
* I typed my name in the owner field (Brandon Klein).
 +
* I selected''Vibrio cholerae'' as the gene database species and then clicked ''Next''.
 +
* The database was saved as ''Vc-Std_BK_20151027''.
 +
* I checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
 +
* Finally, I clicked the "Next" button to begin the export process.
  
 
==Gene Database Testing Report==
 
==Gene Database Testing Report==
Line 158: Line 91:
 
Computer on which export was run: Seaver 120- Last computer on the right in the row closest to the front of the room  
 
Computer on which export was run: Seaver 120- Last computer on the right in the row closest to the front of the room  
  
Postgres Database name: V.cholerae_20151027_gmb3build5
+
Postgres Database name: ''V.cholerae_20151027_gmb3build5''
  
 
UniProt XML filename: [[Media:Uniprot-organism-243277.zip|uniprot-organism-243277_BK_20151027.xml]]
 
UniProt XML filename: [[Media:Uniprot-organism-243277.zip|uniprot-organism-243277_BK_20151027.xml]]
 
* UniProt XML version: UniProt release 2015_10 (version information can be accessed on [http://uniprot.org/news the UniProt News Page])  
 
* UniProt XML version: UniProt release 2015_10 (version information can be accessed on [http://uniprot.org/news the UniProt News Page])  
* UniProt XML download link: [[http://www.uniprot.org/uniprot/?query=organism:243277]]
+
* UniProt XML download link: http://www.uniprot.org/uniprot/?query=organism:243277
 
* Time taken to import: 2.92 minutes
 
* Time taken to import: 2.92 minutes
** Note:  
+
** Note: The import time was similar to my homework partner's, and no interruptions occurred during this process.
  
 
GO OBO-XML filename: [[Media:Go daily-termdb.zip|go_daily-termdb_BK_20151027.obo-xml]]
 
GO OBO-XML filename: [[Media:Go daily-termdb.zip|go_daily-termdb_BK_20151027.obo-xml]]
 
* GO OBO-XML version: File last modified on ‎Tuesday, ‎October ‎27, ‎2015, ‏‎2:24:02 AM (accessed from the file properties window)
 
* GO OBO-XML version: File last modified on ‎Tuesday, ‎October ‎27, ‎2015, ‏‎2:24:02 AM (accessed from the file properties window)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads]
+
* GO OBO-XML download link: http://geneontology.org/page/download-ontology#Legacy_Downloads
 
* Time taken to import: 6.88 minutes
 
* Time taken to import: 6.88 minutes
 
* Time taken to process: 4.49 minutes
 
* Time taken to process: 4.49 minutes
** Note:
+
** Note: The import and processing times were once again similar to my homework partner's, and no interruptions occurred during these processes.
 
+
 
GOA filename: [[Media:46.V cholerae ATCC 39315.zip|46.V_cholerae_ATCC_39315_BK_20151027.goa]]
 
GOA filename: [[Media:46.V cholerae ATCC 39315.zip|46.V_cholerae_ATCC_39315_BK_20151027.goa]]
 
* GOA version: File last modified on Wednesday, October 14, 2015 (accessed on the [http://www.ebi.ac.uk/GOA/ UniProt-GOA news page])
 
* GOA version: File last modified on Wednesday, October 14, 2015 (accessed on the [http://www.ebi.ac.uk/GOA/ UniProt-GOA news page])
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/46.V_cholerae_ATCC_39315.goa]
+
* GOA download link: http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/46.V_cholerae_ATCC_39315.goa
 
* Time taken to import: 0.06 minutes
 
* Time taken to import: 0.06 minutes
** Note:
+
** Note: This import process was equally quick for my homework partner and presented no complications.
 
+
 
Name of .gdb file: [[Media:Vc-Std 20151027.zip|Vc-Std_BK_20151027.gdb]]
 
Name of .gdb file: [[Media:Vc-Std 20151027.zip|Vc-Std_BK_20151027.gdb]]
 
* Time taken to export: 1 hour, 27 minutes  
 
* Time taken to export: 1 hour, 27 minutes  
 
** Start time: 4:44 PM (restarted export after class)
 
** Start time: 4:44 PM (restarted export after class)
 
** End time: 6:11 PM
 
** End time: 6:11 PM
 
+
**Note: Upon checking on my export progress after class, I found out that the export had prematurely stopped. Therefore, I restarted the process. When I returned to the S120 computer lab on 10/29, the export had completed successfully and all windows remained open.
Note:
+
  
 
===TallyEngine===
 
===TallyEngine===
  
* Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
+
* I ran the Tally Engine in GenMAPP Builder to process the number of records for UniProt and GO in the XML data and in the Postgres databases. This was done to verify that the XML data transferred consistently into the PostgreSQL database.
** Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
+
** I selected the menu item Tallies > Run XML and Database Tallies for UniProt and GO... to run the Tally Engine. The results are pictured below:
** Take a screenshot of the results.  Upload the image to the wiki and display it on this page.
+
**[[File:TallyresultsBK1029.png]]
** For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]
+
***XML and database counts were consistent among all assessed data types.
 +
***Note: the "Ordered Locus" count was 3831.
  
 
===Using XMLPipeDB match to Validate the XML Results from the TallyEngine===
 
===Using XMLPipeDB match to Validate the XML Results from the TallyEngine===
  
[[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | Follow the instructions found on this page to run XMLPipeDB match.]]
+
On the command line (''cmd'' on Windows), I entered the folder ''Bklein7_Week9'' to access the XML data files. The first match I ran was designed to identify matches of any ordered locus name following the pattern VC_#### in the UniProt XML file. The command sequence used is as follows:
  
Are your results the same as you got for the TallyEngine? Why or why not?
+
  java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml" >     
 +
OrderedLocusNames.txt
  
===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
+
The results of this match are pictured below.
 +
*[[File:XmlpipedbmatchoutputBK1029.png]]
  
For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]
+
*The number of unique matches yielded by XMLPipeDB Match, 2738, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries.
 +
*I investigated this discrepancy by opening the gene database file ''Vc-Std_BK_20151027.gdb'' in Microsoft Access. Within Access, I opened the "OrdereLocusNames" table.
 +
**Quickly skimming through this tablerevealed a large number of Ordered Locus entries that did not follow the pattern VC_####. Instead, these entries followed the pattern VC_A####. Therefore, I edited the pattern I input into XMLPipeDB Match.
  
You can also look for counts at the SQL level, using some variation of a ''select count(*)'' query. This requires some knowledge of which table received what data.  Here&rsquo;s an initial tip: the ''gene/name'' tags in the XML file land in the ''genenametype'' table.  A query on this table counting values from this table that were marked as ''ordered locus'' in the XML file matching the pattern ''VC_[0-9][0-9][0-9][0-9]'' would look like this:
+
The second match I ran accounted for the issues above. The command sequence used is as follows:
  
  select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';
+
  java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml" 
 +
> OrderedLocusNames.txt
  
In ''pgAdmin III'', you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the ''SQL Editor'' tab, then clicking on the green triangular ''Play'' button to run.
+
The results of this match are pictured below.
 +
*[[File:XmlpipedbmatchoutputBK1029 Updated.png]]
  
[[Image:Pgadminiii-query.png]]
+
*With this updated pattern, XMLPipeDB yielded 3831 unique matches. This matched the Tally Engine results.
 +
**Upon accounting for the alternate Ordered Locus entry pattern, XMLPipeDB validated the results of Tally Engine by calculating the same number of entries in the source data files used in the SQL database.
  
Are your results the same as reported by the TallyEngine?  Why or why not?
+
===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
  
===OriginalRowCounts Comparison===
+
The ''gene/name'' tags in the XML file land in the ''genenametype'' table, which contains ''ordered locus'' values. As in XMLPipeDB Match, I started with a SQL query designed to match the pattern VC_####:
  
Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.
+
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';
  
Benchmark .gdb file:
+
The results of this query are pictured below.
 +
[[File:PostgresIDsBK1029.png]]
  
Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:
+
*The number of unique matches yielded by this SQL query, 2737, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries. Further, it had one less entry than the original XMLPipeDB query.
 +
**As with the first XMLPipeDB match, the main issue here was the fact that I did not account for the occasional presence of the letter "A" in the Ordered Locus entries.
  
Note:
+
I ran a second SQL query to include the pattern VC_A#### in the count output:
  
===Visual Inspection===
+
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';
  
Perform visual inspection of individual tables to see if there are any problems.
+
The results of this query are pictured below.
 +
[[File:PostgresIDsBK1029 UPDATED.png]]
  
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
+
*With the secondary pattern included, this SQL query yielded 3831 unique matches. This matched the Tally Engine results as well as the updated XMLPipeDB Match results.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
+
**These findings further validated the Tally Engine results.
  
Note:
+
===OriginalRowCounts Comparison===
  
===.gdb Use in GenMAPP===
+
I opened the gene database file [[Media:Vc-Std 20151027.zip|Vc-Std_BK_20151027.gdb]] in Microsoft Access to see if the the expected tables were listed with the expected number of records. This information was accessed in the ''OriginalRowCounts'' table.
 +
*A listing named "OrderedLocusNames" was present.
 +
*The table reported 7664 distinct Ordered Locus Names.
 +
**This value is equal to (3831*2)+1
  
Note:
+
Benchmark .gdb file: [[File:Vc-Std External 20101022.gdb]]
  
===Putting a gene on the MAPP using the GeneFinder window===
+
As a sanity check, I compared the information in this table to the ''OriginalRowCounts'' table present in the 2010 Vibrio cholerae gene database that we worked with last week.
 +
*The "OrderedLocusNames" listing was also present here.
 +
*The table reported the same number of distinct Ordered Locus Names: 7664.
 +
*Although the above value was the same, row counts for various most of the other tables such as "GeneOntologyTree" dramatically increased between the 2010 and 2015 databases.
 +
*Overall, 10 more tables were listed in the 2015 database.
  
* Try a sample ID from each of the gene ID systems. Open the Backpage and see if all of the cross-referenced IDs that are supposed to be there are there.
+
OriginalRowCounts table from the benchmark and new gdb:
 +
*[[File:OriginalRowCounts ComparisonImage.PNG]]
  
Note:
+
Note: The ''OriginalRowCounts'' tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the ''OriginalRowCounts'' tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row counts are highlighted in yellow.
  
===Creating an Expression Dataset in the Expression Dataset Manager===
+
===Visual Inspection===
  
* How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?
+
I visually inspected individual tables within the gene database using Microsoft Access to see if there were any problems.
  
Note:
+
* First, I looked at the Systems table. This table includes columns detailing 35 different gene ID systems from which data was accessed.
 +
**Of the 35 gene ID systems, only 11 have dates listed in the date column.
 +
*** Those that do have values entered here all share the same date: 10/27/2015. This was the day on which the gene database was created, suggesting that these entries serve as markers for when the raw data was accessed from the gene ID systems' sites.
 +
*** The fact that the other 24 gene ID systems do not have dates listed is problematic, as older versions of the data stored on these systems may have been accessed without this verification stamp.
 +
* Next, I opened the UniProt, RefSeq, and OrderedLocusNames tables to see if the listed IDs were listed in the correct forms.
 +
**UniProt
 +
***The UniProt ID column includes accession numbers from the UniProt gene entries. The different possibilities for UniProt accession numbers are detailed on the UniProt website [http://www.uniprot.org/help/accession_numbers here].
 +
***All ID's in the UniProt table conform to the following pattern (originally found on the page linked to above): [[File:UniProt Ascension Number info.PNG]]
 +
***Thus, there are no issues with these 3789 gene IDs.
 +
**RefSeq
 +
***The RefSeq ID's are series of numbers with one of three prefixes: "NP_", "YP_", and "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
 +
***"NP_" and "YP_" Prefixes
 +
****Refer to proteins. There are 3329 ID's with these prefixes (3328 for NP_ and 1 for YP_).
 +
***"WP_" Prefixes
 +
****Refer to " autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3328 ID's with these prefixes.
 +
***Overall, every entry in the ID column was an expected value.
 +
**OrderedLocusNames
 +
***7664 ID entries were present that fell into 4 different patterns:
 +
****VC_####
 +
****VC_A####
 +
****VC####
 +
****VCA####
 +
***This ID column was analyzed in Microsoft Excel to investigate why there were 7664 entries present (as opposed to the consistent count of 3831 in SQL and XMLPipeDB Match).
 +
****The analysis in Excel revealed that the variations of the the gene IDs with and without underscores were replicates of one another. Presumably, these gene IDs were duplicated during the process of exporting the gene database to account for a formatting issue pertaining to underscores. Regardless, this discovery means that only half of the 7664 entries were unique gene IDs. This leaves us with '''3832''' unique IDs, which is still one more than was expected.
 +
****Further analysis in Excel revealed that both the XML and SQL gene ID's were off by one due to errors, resulting in the erroneous count of 3831. The XML entries did not include the gene ID '''VC_A0360.1''' (the period presented a formatting issue). The SQL entries did include VC_A0360.1, but an error occurred where two gene IDs were merged into one: '''VC_1738/VC_1739'''. The overall number of unique IDs is 3832 when these errors errors are taken into consideration.
  
===Coloring a MAPP with expression data===
+
Note: The Excel analysis performed above was done following the procedure present on this page- [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/Using_Microsoft_Excel_to_Compare_ID_Lists Using Microsoft Excel to Compare ID Lists]. The Excel file I used for the analysis itself can be accessed here: [[File:OrderedLocusNames InvestigationBK1029.xlsx]].
  
Note:
+
==Vc-Std_BK_20151027.gdb Use in GenMAPP==
  
===Running MAPPFinder===
+
Note: The ability of ''Vc-Std_BK_20151027.gdb'' to be processed was tested in GenMAPP Version 2.1. Before conducting any of the testing below, ''Vc-Std_BK_20151027.gdb'' was selected as the "Gene Database" withing GenMAPP.
  
Note:
+
===Putting a gene on the MAPP using the GeneFinder window===
 +
To figure out how to perform this function, I referenced the "Gene Finder" tab within the "GenMAPP Help" window.
  
 +
*The following sample IDs were taken from different gene ID systems and tested:
 +
**UniProt- Q9KQM0
 +
***The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
 +
***All expected cross-referenced IDs were present.
 +
***[[File:UniProt Gene Backpage.png]]
 +
**RefSeq- NP_230721
 +
***The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
 +
***All expected cross-referenced IDs were present.
 +
***[[File:Refseq Gene Backpage.png]]
 +
**OrderedLocusNames- VC_0274
 +
***The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
 +
***All expected cross-referenced IDs were present.
 +
***[[File:Orderedlocusnames Gene Backpage.png]]
 +
**GeneID- 2612551
 +
***The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
 +
***All expected cross-referenced IDs were present.
 +
***[[File:GeneID Gene Backpage.png]]
 +
**EnsemblBacteria- AAF96503
 +
***The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
 +
***All expected cross-referenced IDs were present.
 +
***[[File:EnsemblBacteria Gene Backpage.png]]
 +
*Screenshot of all of the sample ID's on a MAPP:
 +
**[[File:All genes on MAPP.png]]
  
===Compare Gene Database to Outside Resource===
+
Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.
  
The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML.  It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.)  Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.
+
===Creating an Expression Dataset in the Expression Dataset Manager===
 +
The file [[File:Merrell Compiled Raw Data Vibrio BK 20151015- Tab Delimited.txt]] from my Week 8 assignment page was used to create a new Expression Dataset.
  
Note:
+
* How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset.  Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?
  
==Checking the Quality of your Exported Gene Database==
+
*5100 of the 5221 gene IDs from the original text file were imported into the Expression Dataset.
 +
*There were 121 exceptions.
 +
*Within the EX.txt file, the error message "Gene not found in OrderedLocusNames or any related system." was listed 121 times.
 +
*Two examples of a genes that encountered the above error message were the genes with the following OrderedLocusNames (in the EX.txt file): VC2209 and VCA1031.
 +
**When both of these IDs were searched in the Microsoft Access "OrderedLocusNames" table, no results were retrieved.
 +
**When both of these IDs were searched in the XMLPipeDB Match results file created earlier in this protocol, no results were retrieved.
 +
**Consequentially, it does not appear that the genes that retrieved error messages were present in the original UniProt XML file.
 +
***This finding would be explain the observation that individuals using the 2010 V. cholerae gene database last week, such as my Week 8 partner Veronica, also retrieved 121 errors during this step.
  
* It is a good idea to check the quality of your exported Gene Database to make sure that all of the data from the XML files made it into the PostgreSQL database and was then exported to the GenMAPP Gene Database. We have created a [[Gene Database Testing Report Sample]] to help guide you through this process.
+
Note: 121 errors were encountered during the creation of this new Expression Dataset. A screenshot of the error message is shown here:
== Tally Engine ==
+
*[[File:GenMAPP Errors 20151102.png]]
  
The first tool, called the Tally Engine, can be used for verifying that certain data from the XML file transferred consistently into the PostgreSQL database upon import.  The Tally Engine can be found in GenMAPP Builder itself.
+
===Coloring a MAPP with expression data===
  
# Run PostgreSQL (via pgAdmin III on Windows) and make sure that your database is up and running.
+
* I customized the new Expression Dataset by creating a Color Set. The new Color Set was entitled "LogFoldChange".
# Run GenMAPP Builder and make sure that it is connected to the database (via ''Configure Database...'').
+
**First, I created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
# After performing an import, choose ''Run XML and Database Tallies for UniProt and GO...''.
+
***I specified the Gene value as "Avg_LogFC_all" for the Vibrio dataset.
# Choose the UniProt and GO files that you imported.
+
***I activated the Criteria Builder by clicking the New button and named the criterion "Increased".
# You should see a table for selected data items, and how many of each were found.
+
***I selected the color for this criterion as red using the color box.
 +
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>
 +
**Second, I created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
 +
***I specified the Gene value as "Avg_LogFC_all" for the Vibrio dataset.
 +
***I activated the Criteria Builder by clicking the New button and named the criterion "Decreased".
 +
***I selected the color for this criterion as green using the color box.
 +
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
 +
* Upon entering these color sets, I saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated my .gex file with the new Color Set.
  
[[Image:Tally-results.png]]
+
Screenshot of Color Set criteria:
 +
*[[File:Colorset.png]]
  
Under the hood, the Tally Engine bases its XML counts on certain XML tags, and bases its database counts on SQL queries using ''count''.  This tool is thus primarily useful for making sure that the &ldquo;raw&rdquo; import worked without any errors or glitches.
+
Note: No errors were encountered in the creation of the Color Set.
  
My Tally Results:
+
===Running MAPPFinder===
  
[[File:TallyresultsBK1029.png]]
+
* I launched the MAPPFinder program from within GenMAPP and ensured that the ''Vc-Std_BK_20151027.gdb'' Gene Database was still loaded into GenMAPP.
 +
* I clicked on the button "Calculate New Results" followed by "Find File", at which point I the .gex file updated during the creation of the "LogFoldChange" color set.
 +
* I chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange Color Set to the data.
 +
* I checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
 +
**This analysis took several minutes to complete.
 +
*MAPPFinder analysis results
 +
**Initial result in MAPPFinder Browser:
 +
***[[File:MAPPFinder Result BK20151103.png]]
 +
**I clicked on the menu item "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
 +
**[[File:MAPPFinder RankedList BK20151103.png]]
 +
***As an aside, it is interesting to note that few of the top 10 gene ontology results from this MAPPFinder analysis were conserved from my analysis using the 2009 ''Vibrio cholerae'' gene database. This suggests the high degree of change and progress made in genomics in the last six years.
  
== XMLPipeDB Match ==
+
Note: The MAPPFinder analysis took approximately 16 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the gene database created this week.
  
XMLPipeDB Match is useful for counting data in files.  Thus, in our context, you would use XMLPipeDB Match to tally stuff in XML files, with greater flexibility than with Tally Engine.
+
===Compare Gene Database to Outside Resource===
 
+
I will get back to this question during the Group Final Project, as was allowed:
You will have to use XMLPipeDB Match from the command line.  In addition, you can use this in any platform (as you have seen).  Download the application from the [http://sourceforge.net/projects/xmlpipedb XMLPipeDB SourceForge site] and take note of the location of the ''xmlpipedb-match-1.1.1.jar''. Then, on the command line (''Terminal'' for Linux and Mac OS X, ''cmd'' on Windows), ''cd'' to the folder containing the XML file that you would like to check.  Use XMLPipeDB Match as follows, with the parts in parentheses varying depending on your specific setup, desired pattern, and file being scanned:
+
  This section applies to the Group Final Project and does not need to be completed for the  
 
+
  Week 9 assignment. Kdahlquist (talk) 15:46, 2 November 2015 (PST)
java -jar (location-of-jar) "(pattern)" < (XML file)
+
The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XMLIt is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.
 
+
On a Windows machine, with XMLPipeDB Match and a ''Vibrio cholerae'' XML file located on the Desktop, scanning for IDs of the form '''VC_####''', where '''#''' represents a digit from 0 to 9, one would type, after ''cd''-ing to the Desktop:
+
 
+
  java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < uniprot-taxonomy%3A243277.xml
+
 
+
As you have seen before, this will give you a list of unique matches, with a total number at the bottom.
+
 
+
The trick with XMLPipeDB Match is to use the patterns well: with the database project, you will mainly be matching IDsA desired count is an XMLPipeDB Match result whose matched ID pattern corresponds to the number of IDs found by the Tally Engine.
+
 
+
== SQL ==
+
 
+
You can also look for counts at the SQL level, using some variation of a ''select count(*)'' query.  This requires some knowledge of which table received what data.  Here&rsquo;s an initial tip: the ''gene/name'' tags in the XML file land in the ''genenametype'' table.  A query on this table counting values from this table that were marked as ''ordered locus'' in the XML file matching the pattern ''VC_[0-9][0-9][0-9][0-9]'' would look like this:
+
 
+
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';
+
 
+
In ''pgAdmin III'', you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the ''SQL Editor'' tab, then clicking on the green triangular ''Play'' button to run.
+
 
+
[[Image:pgadminiii-query.png]]
+
 
+
== Microsoft Access ==
+
 
+
For the GenMAPP Gene Database, you can open the ''.gdb'' in Microsoft Access and navigate its tables to find counts for various IDsOpening the table, noting its size, and doing some sorting may help.  You can also look at the ''OriginalRowCounts'' table for a summary of totals.
+
 
+
Again, the ideal situation is a correspondence in these numbers with what you found in XML and the relational database.
+
 
+
== Back to the Command Line ==
+
 
+
Amidst all this, you can still use ''grep'' and ''wc'' on the command line for some basic counting.  Just remember that these tools work on a line-by-line basis; useful in some cases, but not useful in others.
+
 
+
You can use ''grep'' and ''wc'' with the various files on the ''my.cs.lmu.edu'' server by using the ''curl -O'' command shown in the [[Week 6]] assignment. Upload your data files to the wiki, place media links to them on your wiki page, then mouse over those live links to capture their URL ([[Media:Movie.sql.txt|like this—look at the source to see the wiki markup]]), then use ''curl -O (whatever-the-url-is)'' while ''ssh''-ed to ''my.cs.lmu.edu'' to bring that file into the server.
+
 
+
* If the file is a ''.zip'' file, you can use ''unzip'' at the command line to unzip it.
+
* If the file is a ''.gz'' file, you can use ''gunzip'' at the command line to uncompress that one.
+
 
+
<!-- Future content: advanced techniques for diff'ing grep/match results against psql results, duly processed by sed. -->
+
  
 
==Links==
 
==Links==

Latest revision as of 22:31, 23 November 2015

Files Asked for in the Gene Database Testing Report

For convenience, all of the files explicitly asked for in the "Gene Database Testing Report" section were compressed together in this file: File:Bklein7 Week9 Files.zip.

Pre-requisites

The following set of software was used in the creation and testing of the Vibrio cholerae gene database:

  1. 7-ziptool that for unpacking .gz and .zip files
  2. PostgreSQL on Windows (version 9.4.x)
  3. GenMAPP Builder
  4. Java JDK 1.8 64-bit
  5. GenMAPP 2
  6. XMLPipeDB match utility for counting IDs in XML files
  7. Microsoft Access for reading .mdb files

Gene Database Creation

Downloading Data Source Files and GenMAPP Builder

  • I download the UniProt XML, GOA, and GO OBO-XML files for Vibrio cholerae along with the GenMAPP Builder program.
    • All files were saved to the folder Bklein7_Week9 on my computer's ThawSpace.
    • Files that required extraction were unzipped using 7-zip.
    • Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

UniProt XML

GOA

GO OBO-XML

Downloaded GenMAPP Builder

  1. I downloaded the GenMAPP Builder zip folder: Download gmbuilder-3.0.0-build-5.zip.
  2. I extracted the GenMAPP Builder folder using 7-zip.

Creating the New Database in PostgreSQL

  • I launched pgAdmin III and connected to the PostgreSQL 9.4 server (localhost:5432).
    • On this server, I created a new database: V.cholerae_20151027_gmb3build5.
    • I opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
      • I clicked on the Open File icon and selected the file gmbuilder.sql. This imported a series of SQL commands into the editor tab.
      • I clicked on the Execute Query icon to run this command.
      • In viewing the schema for this database, I confirmed that there were 167 tables after running the above command.

Configuring GenMAPP Builder to Connect to the PostgreSQL Database

  • To begin, I launched gmbuilder.bat.
  • I selected the "Configure Database" option and entered the following information into the fields below:
    • Host or address: localhost
    • Port number: 5432
    • Database name: V.cholerae_20151027_gmb3build5
    • Username: postgres
    • Password: Welcome1

Importing Data into the PostgreSQL Database

  • The downloaded data files for Vibrio cholerea were specified and imported into the database by clicking on the following buttons:
    • Selected File > Import UniProt XML...
    • Selected File > Import GO OBO-XML...
    • Clicked OK to the message asking to process the GO data.
    • Selected File > Import GOA...

Exporting a GenMAPP Gene Database (.gdb)

  • I selected File > Export to GenMAPP Gene Database... to begin the export process.
  • I typed my name in the owner field (Brandon Klein).
  • I selectedVibrio cholerae as the gene database species and then clicked Next.
  • The database was saved as Vc-Std_BK_20151027.
  • I checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
  • Finally, I clicked the "Next" button to begin the export process.

Gene Database Testing Report

Export Information

Version of GenMAPP Builder: Version 3.0.0 Build 5

Computer on which export was run: Seaver 120- Last computer on the right in the row closest to the front of the room

Postgres Database name: V.cholerae_20151027_gmb3build5

UniProt XML filename: uniprot-organism-243277_BK_20151027.xml

GO OBO-XML filename: go_daily-termdb_BK_20151027.obo-xml

  • GO OBO-XML version: File last modified on ‎Tuesday, ‎October ‎27, ‎2015, ‏‎2:24:02 AM (accessed from the file properties window)
  • GO OBO-XML download link: http://geneontology.org/page/download-ontology#Legacy_Downloads
  • Time taken to import: 6.88 minutes
  • Time taken to process: 4.49 minutes
    • Note: The import and processing times were once again similar to my homework partner's, and no interruptions occurred during these processes.

GOA filename: 46.V_cholerae_ATCC_39315_BK_20151027.goa

Name of .gdb file: Vc-Std_BK_20151027.gdb

  • Time taken to export: 1 hour, 27 minutes
    • Start time: 4:44 PM (restarted export after class)
    • End time: 6:11 PM
    • Note: Upon checking on my export progress after class, I found out that the export had prematurely stopped. Therefore, I restarted the process. When I returned to the S120 computer lab on 10/29, the export had completed successfully and all windows remained open.

TallyEngine

  • I ran the Tally Engine in GenMAPP Builder to process the number of records for UniProt and GO in the XML data and in the Postgres databases. This was done to verify that the XML data transferred consistently into the PostgreSQL database.
    • I selected the menu item Tallies > Run XML and Database Tallies for UniProt and GO... to run the Tally Engine. The results are pictured below:
    • TallyresultsBK1029.png
      • XML and database counts were consistent among all assessed data types.
      • Note: the "Ordered Locus" count was 3831.

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

On the command line (cmd on Windows), I entered the folder Bklein7_Week9 to access the XML data files. The first match I ran was designed to identify matches of any ordered locus name following the pattern VC_#### in the UniProt XML file. The command sequence used is as follows:

java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml" >       
OrderedLocusNames.txt

The results of this match are pictured below.

  • XmlpipedbmatchoutputBK1029.png
  • The number of unique matches yielded by XMLPipeDB Match, 2738, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries.
  • I investigated this discrepancy by opening the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access. Within Access, I opened the "OrdereLocusNames" table.
    • Quickly skimming through this tablerevealed a large number of Ordered Locus entries that did not follow the pattern VC_####. Instead, these entries followed the pattern VC_A####. Therefore, I edited the pattern I input into XMLPipeDB Match.

The second match I ran accounted for the issues above. The command sequence used is as follows:

java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < "uniprot-taxonomy%3A243277.xml"   
> OrderedLocusNames.txt

The results of this match are pictured below.

  • XmlpipedbmatchoutputBK1029 Updated.png
  • With this updated pattern, XMLPipeDB yielded 3831 unique matches. This matched the Tally Engine results.
    • Upon accounting for the alternate Ordered Locus entry pattern, XMLPipeDB validated the results of Tally Engine by calculating the same number of entries in the source data files used in the SQL database.

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

The gene/name tags in the XML file land in the genenametype table, which contains ordered locus values. As in XMLPipeDB Match, I started with a SQL query designed to match the pattern VC_####:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_[0-9][0-9][0-9][0-9]';

The results of this query are pictured below. PostgresIDsBK1029.png

  • The number of unique matches yielded by this SQL query, 2737, was significantly lower than the Tally Engine output of 3831 Ordered Locus entries. Further, it had one less entry than the original XMLPipeDB query.
    • As with the first XMLPipeDB match, the main issue here was the fact that I did not account for the occasional presence of the letter "A" in the Ordered Locus entries.

I ran a second SQL query to include the pattern VC_A#### in the count output:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';

The results of this query are pictured below. PostgresIDsBK1029 UPDATED.png

  • With the secondary pattern included, this SQL query yielded 3831 unique matches. This matched the Tally Engine results as well as the updated XMLPipeDB Match results.
    • These findings further validated the Tally Engine results.

OriginalRowCounts Comparison

I opened the gene database file Vc-Std_BK_20151027.gdb in Microsoft Access to see if the the expected tables were listed with the expected number of records. This information was accessed in the OriginalRowCounts table.

  • A listing named "OrderedLocusNames" was present.
  • The table reported 7664 distinct Ordered Locus Names.
    • This value is equal to (3831*2)+1

Benchmark .gdb file: File:Vc-Std External 20101022.gdb

As a sanity check, I compared the information in this table to the OriginalRowCounts table present in the 2010 Vibrio cholerae gene database that we worked with last week.

  • The "OrderedLocusNames" listing was also present here.
  • The table reported the same number of distinct Ordered Locus Names: 7664.
  • Although the above value was the same, row counts for various most of the other tables such as "GeneOntologyTree" dramatically increased between the 2010 and 2015 databases.
  • Overall, 10 more tables were listed in the 2015 database.

OriginalRowCounts table from the benchmark and new gdb:

  • OriginalRowCounts ComparisonImage.PNG

Note: The OriginalRowCounts tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the OriginalRowCounts tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row counts are highlighted in yellow.

Visual Inspection

I visually inspected individual tables within the gene database using Microsoft Access to see if there were any problems.

  • First, I looked at the Systems table. This table includes columns detailing 35 different gene ID systems from which data was accessed.
    • Of the 35 gene ID systems, only 11 have dates listed in the date column.
      • Those that do have values entered here all share the same date: 10/27/2015. This was the day on which the gene database was created, suggesting that these entries serve as markers for when the raw data was accessed from the gene ID systems' sites.
      • The fact that the other 24 gene ID systems do not have dates listed is problematic, as older versions of the data stored on these systems may have been accessed without this verification stamp.
  • Next, I opened the UniProt, RefSeq, and OrderedLocusNames tables to see if the listed IDs were listed in the correct forms.
    • UniProt
      • The UniProt ID column includes accession numbers from the UniProt gene entries. The different possibilities for UniProt accession numbers are detailed on the UniProt website here.
      • All ID's in the UniProt table conform to the following pattern (originally found on the page linked to above): UniProt Ascension Number info.PNG
      • Thus, there are no issues with these 3789 gene IDs.
    • RefSeq
      • The RefSeq ID's are series of numbers with one of three prefixes: "NP_", "YP_", and "WP_". The meanings of these prefixes can be found in the RefSeq documentation found here.
      • "NP_" and "YP_" Prefixes
        • Refer to proteins. There are 3329 ID's with these prefixes (3328 for NP_ and 1 for YP_).
      • "WP_" Prefixes
        • Refer to " autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3328 ID's with these prefixes.
      • Overall, every entry in the ID column was an expected value.
    • OrderedLocusNames
      • 7664 ID entries were present that fell into 4 different patterns:
        • VC_####
        • VC_A####
        • VC####
        • VCA####
      • This ID column was analyzed in Microsoft Excel to investigate why there were 7664 entries present (as opposed to the consistent count of 3831 in SQL and XMLPipeDB Match).
        • The analysis in Excel revealed that the variations of the the gene IDs with and without underscores were replicates of one another. Presumably, these gene IDs were duplicated during the process of exporting the gene database to account for a formatting issue pertaining to underscores. Regardless, this discovery means that only half of the 7664 entries were unique gene IDs. This leaves us with 3832 unique IDs, which is still one more than was expected.
        • Further analysis in Excel revealed that both the XML and SQL gene ID's were off by one due to errors, resulting in the erroneous count of 3831. The XML entries did not include the gene ID VC_A0360.1 (the period presented a formatting issue). The SQL entries did include VC_A0360.1, but an error occurred where two gene IDs were merged into one: VC_1738/VC_1739. The overall number of unique IDs is 3832 when these errors errors are taken into consideration.

Note: The Excel analysis performed above was done following the procedure present on this page- Using Microsoft Excel to Compare ID Lists. The Excel file I used for the analysis itself can be accessed here: File:OrderedLocusNames InvestigationBK1029.xlsx.

Vc-Std_BK_20151027.gdb Use in GenMAPP

Note: The ability of Vc-Std_BK_20151027.gdb to be processed was tested in GenMAPP Version 2.1. Before conducting any of the testing below, Vc-Std_BK_20151027.gdb was selected as the "Gene Database" withing GenMAPP.

Putting a gene on the MAPP using the GeneFinder window

To figure out how to perform this function, I referenced the "Gene Finder" tab within the "GenMAPP Help" window.

  • The following sample IDs were taken from different gene ID systems and tested:
    • UniProt- Q9KQM0
      • The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
      • All expected cross-referenced IDs were present.
      • UniProt Gene Backpage.png
    • RefSeq- NP_230721
      • The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
      • All expected cross-referenced IDs were present.
      • Refseq Gene Backpage.png
    • OrderedLocusNames- VC_0274
      • The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
      • All expected cross-referenced IDs were present.
      • Orderedlocusnames Gene Backpage.png
    • GeneID- 2612551
      • The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
      • All expected cross-referenced IDs were present.
      • GeneID Gene Backpage.png
    • EnsemblBacteria- AAF96503
      • The Backpage for this gene included links to the following gene ID systems: UniProt, GeneOntology, OrderedLocusNames, EnsemblBacteria, EMBL, InterPro, GeneID, Pfam, and RefSeq.
      • All expected cross-referenced IDs were present.
      • EnsemblBacteria Gene Backpage.png
  • Screenshot of all of the sample ID's on a MAPP:
    • All genes on MAPP.png

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

Creating an Expression Dataset in the Expression Dataset Manager

The file File:Merrell Compiled Raw Data Vibrio BK 20151015- Tab Delimited.txt from my Week 8 assignment page was used to create a new Expression Dataset.

  • How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?
  • 5100 of the 5221 gene IDs from the original text file were imported into the Expression Dataset.
  • There were 121 exceptions.
  • Within the EX.txt file, the error message "Gene not found in OrderedLocusNames or any related system." was listed 121 times.
  • Two examples of a genes that encountered the above error message were the genes with the following OrderedLocusNames (in the EX.txt file): VC2209 and VCA1031.
    • When both of these IDs were searched in the Microsoft Access "OrderedLocusNames" table, no results were retrieved.
    • When both of these IDs were searched in the XMLPipeDB Match results file created earlier in this protocol, no results were retrieved.
    • Consequentially, it does not appear that the genes that retrieved error messages were present in the original UniProt XML file.
      • This finding would be explain the observation that individuals using the 2010 V. cholerae gene database last week, such as my Week 8 partner Veronica, also retrieved 121 errors during this step.

Note: 121 errors were encountered during the creation of this new Expression Dataset. A screenshot of the error message is shown here:

  • GenMAPP Errors 20151102.png

Coloring a MAPP with expression data

  • I customized the new Expression Dataset by creating a Color Set. The new Color Set was entitled "LogFoldChange".
    • First, I created a criterion for this color set to label genes that demonstrated a significant increase in their expression.
      • I specified the Gene value as "Avg_LogFC_all" for the Vibrio dataset.
      • I activated the Criteria Builder by clicking the New button and named the criterion "Increased".
      • I selected the color for this criterion as red using the color box.
      • I stated the criterion as follows and added it to the Criteria List: [Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05
    • Second, I created a criterion for this color set to label genes that demonstrated a significant decrease in their expression.
      • I specified the Gene value as "Avg_LogFC_all" for the Vibrio dataset.
      • I activated the Criteria Builder by clicking the New button and named the criterion "Decreased".
      • I selected the color for this criterion as green using the color box.
      • I stated the criterion as follows and added it to the Criteria List: [Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05
  • Upon entering these color sets, I saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated my .gex file with the new Color Set.

Screenshot of Color Set criteria:

  • Colorset.png

Note: No errors were encountered in the creation of the Color Set.

Running MAPPFinder

  • I launched the MAPPFinder program from within GenMAPP and ensured that the Vc-Std_BK_20151027.gdb Gene Database was still loaded into GenMAPP.
  • I clicked on the button "Calculate New Results" followed by "Find File", at which point I the .gex file updated during the creation of the "LogFoldChange" color set.
  • I chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange Color Set to the data.
  • I checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
    • This analysis took several minutes to complete.
  • MAPPFinder analysis results
    • Initial result in MAPPFinder Browser:
      • MAPPFinder Result BK20151103.png
    • I clicked on the menu item "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
    • MAPPFinder RankedList BK20151103.png
      • As an aside, it is interesting to note that few of the top 10 gene ontology results from this MAPPFinder analysis were conserved from my analysis using the 2009 Vibrio cholerae gene database. This suggests the high degree of change and progress made in genomics in the last six years.

Note: The MAPPFinder analysis took approximately 16 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the gene database created this week.

Compare Gene Database to Outside Resource

I will get back to this question during the Group Final Project, as was allowed:

This section applies to the Group Final Project and does not need to be completed for the 
Week 9 assignment. —  Kdahlquist (talk) 15:46, 2 November 2015 (PST)

The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.) Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.

Links

Assignments Pages

Individual Journal Entries

Shared Journal Entries