Difference between revisions of "Troque Week 9"

Revision as of 06:23, 3 November 2015

User Page Bio Databases Main Page

Export Information

Version of GenMAPP Builder:

gmbuilder-3.0.0-build-5

Computer on which export was run:

The right-most computer at the very back of the room when facing the wide whiteboard/the computer at the back row that is closest to the door.

Postgres Database name:

V_cholerae_20151027_gmb-5_TR

UniProt XML filename (give filename and upload and link to compressed file):

UniProt XML version (The version information can be found at the UniProt News Page): UniProt release 2015_10
UniProt XML download link: http://www.uniprot.org/uniprot/?query=organism:243277
Time taken to import: 2.91 minutes
- Note:

GO OBO-XML filename (give filename and upload and link to compressed file):

GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the GO Download page has been unzipped): Version created on 10/27/2015 (at 2:24 AM)
GO OBO-XML download link: http://geneontology.org/page/download-ontology#Legacy_Downloads
Time taken to import: 6.93 minutes
Time taken to process: 4.26 minutes
- Note:

GOA filename (give filename and upload and link to compressed file):

GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site): Version released on 10/14/15.
GOA download link: http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/46.V_cholerae_ATCC_39315.goa
Time taken to import: 0.06 minutes
- Note:

Name of .gdb file (give filename and upload and link to compressed file):

Time taken to export: 5 hours, 33 minutes and 40 seconds
- Start time: 3:52:13 PM PDT
- End time: 9:25:53 PM PDT
- Note: I was out of town (went to D.C./Maryland for a research conference) so I couldn't check when the export finished, but it was a good thing that I logged in as myself (instead of the general student account) on the computer since it preserved the windows I had open from before I was left.

TallyEngine

Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
- Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
- Choose the UniProt and GO OBO XML files that was uploaded from the previous sections of this assignment.
- Here is the screenshot of the tally result:

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Follow the instructions found on this page to run XMLPipeDB match.

Note: When I tried downloading the jar file, it wasn't in the zip file that I downloaded when I clicked the link provided in the protocol. However, thanks to Kevin Wyllie who gave me the file, I managed to acquire it.
Note: I created my own directory (called TrixRoq) in order to store all the files I downloaded into one single folder to differentiate it from the rest of the files in the ThawSpace disk. I also made a directory (called Working Data) in the ThawSpace0 disk in order to separate the source data from the data I worked with (per class suggestions from Dr. Dahlquist).
As a result, I had to cd to these directories first before using the command for using Match.
- In order to change into the ThawSpace0\TrixRoq\Working Data directory, use the following commands on the command prompt window:

T: && cd "TrixRoq\Working Data"

The command I used once inside the directory I want is:

java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < uniprot-organism%3A243277.xml

The results are as follows:

These results did not match up with what the TallyEngine gave (TallyEngine: 3831 vs. Match: 2738)

As a result, the commands would have to be modified somehow so that the numbers match. Since the original command only accounted for the pattern with VC_####, we also have to account for the pattern for VC_A?####. The new command is as follows:

   java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < uniprot-organism%3A243277.xml > command_prompt_results_TR.txt

which gave the results:

and saved it to the file "command_prompt_results_TR.txt"

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

Since I already know to look for the pattern VC_A?#### instead of just VC_####, I would type in the following command in the pgAdmin window:

 select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';

The results are as follows:

The result is the same for both TallyEngine and Match (i.e., 3831 counts).

OriginalRowCounts Comparison

Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file. Note: I used the Microsoft Access 2010 program in order to look at the GDB files.

Benchmark .gdb file (since the 2010 file was the more recent one, I figured that I should use this one because it was more updated): Vc-Std_External_20101022

Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:

Benchmark GDB file:

New GDB file:

Note:

There are noticeably more rows in the new GDB than the benchmark GDB file.
The counts are either the same or higher in the new GDB file than they are on the 2010 benchmark GDB file (for the rows that existed for both).

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
- Some of the genes don't have a date in the Date field on the new GDB file.

The 2010 benchmark also did not have all the dates for each gene in the Date field.

Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
- In the UniProt table, there are too many variations in the gene ID formats to actually keep track of all of them. However, they all seem to look like they take the correct form the the IDs they represent
- In the RefSeg table, the gene IDs are in the following formats:
  - NP_######
  - WP_#########
  - YP_#########
- In the OrderedLocusNames table, the gene IDs are in the formats:
  - VC_####
  - VC_A####
  - VC####
  - VCA####

Note:

The UniProt table contains 3789 entries.
The IDs in the RefSeq table seem to have the appropriate formats for the genes, but there are 6697 entries. Only YP_008474262 is the gene with this format.
The OrderedLocusNames table has a total of 7664 entries in contrast to the other ones which only have 3831 entries. However, there is actually twice as much entries here than the other tables since the ID formats with the underscore has a duplicate where the ID does not have an underscore (i.e., if there is a gene with ID VC_1234, there is another similar gene with ID VC1234, but represent the same gene). Halving this number, we get 3832 entries (1 more than what TallyEngine, Match, and SQL gave us).
This extra entry is due to one gene (technically 2 since this gene has a duplicate - one has an underscore and one doesn't) with a format that has not been accounted for. There is a gene with ID VC_A0360.1 (and its corresponding non-underscored ID VCA0360.1).

.gdb Use in GenMAPP

Note:

Putting a gene on the MAPP using the GeneFinder window

Try a sample ID from each of the gene ID systems. Open the Backpage and see if all of the cross-referenced IDs that are supposed to be there are there.

Note:

Creating an Expression Dataset in the Expression Dataset Manager

How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

Note:

Coloring a MAPP with expression data

Note:

Running MAPPFinder

Note:

Assignment Links

Weekly Assignments

Individual Journal Entries

Week 1 - This is technically the user page.
Week 2
Week 3
Week 4
Week 5
Week 6
Week 7
Week 8
Week 9
Week 10
Week 11
Week 12
No Week 13 Assignment
Week 14
Week 15

Shared Journal Entries

@@ Line 102: / Line 102: @@
 * Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
+** Some of the genes don't have a date in the Date field on the '''new''' GDB file.
+<div class="center">
+[[Image:System table result new TR 20151102.jpg]]
+</div>
+* The '''2010 benchmark''' also did not have all the dates for each gene in the Date field.
+<div class="center">
+[[Image:System table result benchmark TR 20151102.jpg]]
+</div>
 * Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
+** In the UniProt table, there are too many variations in the gene ID formats to actually keep track of all of them. However, they all seem to look like they take the correct form the the IDs they represent
+** In the RefSeg table, the gene IDs are in the following formats:
+*** NP_######
+*** WP_#########
+*** YP_#########
+** In the OrderedLocusNames table, the gene IDs are in the formats:
+*** VC_####
+*** VC_A####
+*** VC####
+*** VCA####
 Note:
+* The UniProt table contains 3789 entries.
+* The IDs in the RefSeq table seem to have the appropriate formats for the genes, but there are 6697 entries. Only YP_008474262 is the gene with this format.
+* The OrderedLocusNames table has a total of 7664 entries in contrast to the other ones which only have 3831 entries. However, there is actually twice as much entries here than the other tables since the ID formats with the underscore has a duplicate where the ID does not have an underscore (i.e., if there is a gene with ID VC_1234, there is another similar gene with ID VC1234, but represent the same gene). Halving this number, we get 3832 entries (1 more than what TallyEngine, Match, and SQL gave us).
+* This extra entry is due to one gene (technically 2 since this gene has a duplicate - one has an underscore and one doesn't) with a format that has not been accounted for. There is a gene with ID VC_A0360.1 (and its corresponding non-underscored ID VCA0360.1).
 ==.gdb Use in GenMAPP==

Difference between revisions of "Troque Week 9"

Revision as of 06:23, 3 November 2015

Contents

Export Information

TallyEngine

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

OriginalRowCounts Comparison

Visual Inspection

.gdb Use in GenMAPP

Putting a gene on the MAPP using the GeneFinder window

Creating an Expression Dataset in the Expression Dataset Manager

Coloring a MAPP with expression data

Running MAPPFinder

Assignment Links

Weekly Assignments

Individual Journal Entries

Shared Journal Entries

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools