Difference between revisions of "Troque Week 9"

From LMU BioDB 2015
Jump to: navigation, search
(OriginalRowCounts Comparison: Finished this section)
(Visual Inspection: Finished this section)
Line 102: Line 102:
  
 
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
 
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
 +
** Some of the genes don't have a date in the Date field on the '''new''' GDB file.
 +
<div class="center">
 +
[[Image:System table result new TR 20151102.jpg]]
 +
</div>
 +
* The '''2010 benchmark''' also did not have all the dates for each gene in the Date field.
 +
<div class="center">
 +
[[Image:System table result benchmark TR 20151102.jpg]]
 +
</div>
 
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
 
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
 
+
** In the UniProt table, there are too many variations in the gene ID formats to actually keep track of all of them. However, they all seem to look like they take the correct form the the IDs they represent
 +
** In the RefSeg table, the gene IDs are in the following formats:
 +
*** NP_######
 +
*** WP_#########
 +
*** YP_#########
 +
** In the OrderedLocusNames table, the gene IDs are in the formats:
 +
*** VC_####
 +
*** VC_A####
 +
*** VC####
 +
*** VCA####
 
Note:
 
Note:
 +
* The UniProt table contains 3789 entries.
 +
* The IDs in the RefSeq table seem to have the appropriate formats for the genes, but there are 6697 entries. Only YP_008474262 is the gene with this format.
 +
* The OrderedLocusNames table has a total of 7664 entries in contrast to the other ones which only have 3831 entries. However, there is actually twice as much entries here than the other tables since the ID formats with the underscore has a duplicate where the ID does not have an underscore (i.e., if there is a gene with ID VC_1234, there is another similar gene with ID VC1234, but represent the same gene). Halving this number, we get 3832 entries (1 more than what TallyEngine, Match, and SQL gave us).
 +
* This extra entry is due to one gene (technically 2 since this gene has a duplicate - one has an underscore and one doesn't) with a format that has not been accounted for. There is a gene with ID VC_A0360.1 (and its corresponding non-underscored ID VCA0360.1).
  
 
==.gdb Use in GenMAPP==
 
==.gdb Use in GenMAPP==

Revision as of 06:23, 3 November 2015

User Page        Bio Databases Main Page       


Export Information

Version of GenMAPP Builder:

  • gmbuilder-3.0.0-build-5

Computer on which export was run:

  • The right-most computer at the very back of the room when facing the wide whiteboard/the computer at the back row that is closest to the door.

Postgres Database name:

  • V_cholerae_20151027_gmb-5_TR

UniProt XML filename (give filename and upload and link to compressed file):

GO OBO-XML filename (give filename and upload and link to compressed file):

GOA filename (give filename and upload and link to compressed file):

Name of .gdb file (give filename and upload and link to compressed file):

  • Time taken to export: 5 hours, 33 minutes and 40 seconds
    • Start time: 3:52:13 PM PDT
    • End time: 9:25:53 PM PDT
    • Note: I was out of town (went to D.C./Maryland for a research conference) so I couldn't check when the export finished, but it was a good thing that I logged in as myself (instead of the general student account) on the computer since it preserved the windows I had open from before I was left.

TallyEngine

  • Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
    • Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
    • Choose the UniProt and GO OBO XML files that was uploaded from the previous sections of this assignment.
    • Here is the screenshot of the tally result:

Tally results TR 20151102.jpg

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Follow the instructions found on this page to run XMLPipeDB match.

  • Note: When I tried downloading the jar file, it wasn't in the zip file that I downloaded when I clicked the link provided in the protocol. However, thanks to Kevin Wyllie who gave me the file, I managed to acquire it.
  • Note: I created my own directory (called TrixRoq) in order to store all the files I downloaded into one single folder to differentiate it from the rest of the files in the ThawSpace disk. I also made a directory (called Working Data) in the ThawSpace0 disk in order to separate the source data from the data I worked with (per class suggestions from Dr. Dahlquist).
  • As a result, I had to cd to these directories first before using the command for using Match.
    • In order to change into the ThawSpace0\TrixRoq\Working Data directory, use the following commands on the command prompt window:
T: && cd "TrixRoq\Working Data"
  • The command I used once inside the directory I want is:
java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < uniprot-organism%3A243277.xml
  • The results are as follows:

Command prompt results TR 20151102.jpg

These results did not match up with what the TallyEngine gave (TallyEngine: 3831 vs. Match: 2738)

  • As a result, the commands would have to be modified somehow so that the numbers match. Since the original command only accounted for the pattern with VC_####, we also have to account for the pattern for VC_A?####. The new command is as follows:
   java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < uniprot-organism%3A243277.xml > command_prompt_results_TR.txt

which gave the results:

Command prompt results updated TR 20151102.jpg

and saved it to the file "command_prompt_results_TR.txt"

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

Since I already know to look for the pattern VC_A?#### instead of just VC_####, I would type in the following command in the pgAdmin window:

 select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';

The results are as follows:

Postgresql results TR 20151102.jpg

The result is the same for both TallyEngine and Match (i.e., 3831 counts).

OriginalRowCounts Comparison

Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file. Note: I used the Microsoft Access 2010 program in order to look at the GDB files.

Benchmark .gdb file (since the 2010 file was the more recent one, I figured that I should use this one because it was more updated): Vc-Std_External_20101022

Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:

  • Benchmark GDB file:

Gdb originalrowcounts results TR 20151102.jpg

  • New GDB file:

Gdb originalrowcounts new results TR 20151102.jpg

Note:

  • There are noticeably more rows in the new GDB than the benchmark GDB file.
  • The counts are either the same or higher in the new GDB file than they are on the 2010 benchmark GDB file (for the rows that existed for both).

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

  • Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
    • Some of the genes don't have a date in the Date field on the new GDB file.

System table result new TR 20151102.jpg

  • The 2010 benchmark also did not have all the dates for each gene in the Date field.

System table result benchmark TR 20151102.jpg

  • Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
    • In the UniProt table, there are too many variations in the gene ID formats to actually keep track of all of them. However, they all seem to look like they take the correct form the the IDs they represent
    • In the RefSeg table, the gene IDs are in the following formats:
      • NP_######
      • WP_#########
      • YP_#########
    • In the OrderedLocusNames table, the gene IDs are in the formats:
      • VC_####
      • VC_A####
      • VC####
      • VCA####

Note:

  • The UniProt table contains 3789 entries.
  • The IDs in the RefSeq table seem to have the appropriate formats for the genes, but there are 6697 entries. Only YP_008474262 is the gene with this format.
  • The OrderedLocusNames table has a total of 7664 entries in contrast to the other ones which only have 3831 entries. However, there is actually twice as much entries here than the other tables since the ID formats with the underscore has a duplicate where the ID does not have an underscore (i.e., if there is a gene with ID VC_1234, there is another similar gene with ID VC1234, but represent the same gene). Halving this number, we get 3832 entries (1 more than what TallyEngine, Match, and SQL gave us).
  • This extra entry is due to one gene (technically 2 since this gene has a duplicate - one has an underscore and one doesn't) with a format that has not been accounted for. There is a gene with ID VC_A0360.1 (and its corresponding non-underscored ID VCA0360.1).

.gdb Use in GenMAPP

Note:

Putting a gene on the MAPP using the GeneFinder window

  • Try a sample ID from each of the gene ID systems. Open the Backpage and see if all of the cross-referenced IDs that are supposed to be there are there.

Note:

Creating an Expression Dataset in the Expression Dataset Manager

  • How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

Note:

Coloring a MAPP with expression data

Note:

Running MAPPFinder

Note:

Assignment Links

Weekly Assignments

Individual Journal Entries

Shared Journal Entries