Troque Week 9

User Page Bio Databases Main Page

Files created

Here is the zipped file with all the files I processed for this assignment (may contain extraneous files).

Click here to download the zip file.

Export Information

Version of GenMAPP Builder:

gmbuilder-3.0.0-build-5

Computer on which export was run:

The right-most computer at the very back of the room when facing the wide whiteboard/the computer at the back row that is closest to the door.

Postgres Database name:

V_cholerae_20151027_gmb-5_TR

UniProt XML filename (give filename and upload and link to compressed file):

UniProt XML version (The version information can be found at the UniProt News Page): UniProt release 2015_10
UniProt XML download link: http://www.uniprot.org/uniprot/?query=organism:243277
Time taken to import: 2.91 minutes
- Note:

GO OBO-XML filename (give filename and upload and link to compressed file):

GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the GO Download page has been unzipped): Version created on 10/27/2015 (at 2:24 AM)
GO OBO-XML download link: http://geneontology.org/page/download-ontology#Legacy_Downloads
Time taken to import: 6.93 minutes
Time taken to process: 4.26 minutes
- Note:

GOA filename (give filename and upload and link to compressed file):

GOA version (News on this page records past releases; current information can be found in the Last modified field on the FTP site): Version released on 10/14/15.
GOA download link: http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/46.V_cholerae_ATCC_39315.goa
Time taken to import: 0.06 minutes
- Note:

Name of .gdb file (give filename and upload and link to compressed file):

Time taken to export: 5 hours, 33 minutes and 40 seconds
- Start time: 3:52:13 PM PDT
- End time: 9:25:53 PM PDT
- Note: I was out of town (went to D.C./Maryland for a research conference) so I couldn't check when the export finished, but it was a good thing that I logged in as myself (instead of the general student account) on the computer since it preserved the windows I had open from before I was left.

TallyEngine

Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
- Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
- Choose the UniProt and GO OBO XML files that was uploaded from the previous sections of this assignment.
- Here is the screenshot of the tally result:

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Follow the instructions found on this page to run XMLPipeDB match.

Note: When I tried downloading the jar file, it wasn't in the zip file that I downloaded when I clicked the link provided in the protocol. However, thanks to Kevin Wyllie who gave me the file, I managed to acquire it.
Note: I created my own directory (called TrixRoq) in order to store all the files I downloaded into one single folder to differentiate it from the rest of the files in the ThawSpace disk. I also made a directory (called Working Data) in the ThawSpace0 disk in order to separate the source data from the data I worked with (per class suggestions from Dr. Dahlquist).
As a result, I had to cd to these directories first before using the command for using Match.
- In order to change into the ThawSpace0\TrixRoq\Working Data directory, use the following commands on the command prompt window:

T: && cd "TrixRoq\Working Data"

The command I used once inside the directory I want is:

java -jar xmlpipedb-match-1.1.1.jar "VC_[0-9][0-9][0-9][0-9]" < uniprot-organism%3A243277.xml

The results are as follows:

These results did not match up with what the TallyEngine gave (TallyEngine: 3831 vs. Match: 2738)

As a result, the commands would have to be modified somehow so that the numbers match. Since the original command only accounted for the pattern with VC_####, we also have to account for the pattern for VC_A?####. The new command is as follows:

   java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < uniprot-organism%3A243277.xml > command_prompt_results_TR.txt

which gave the results:

and saved it to the file "command_prompt_results_TR.txt"

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

Since I already know to look for the pattern VC_A?#### instead of just VC_####, I would type in the following command in the pgAdmin window:

 select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';

The results are as follows:

The result is the same for both TallyEngine and Match (i.e., 3831 counts).

OriginalRowCounts Comparison

Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file. Note: I used the Microsoft Access 2010 program in order to look at the GDB files.

Benchmark .gdb file (since the 2010 file was the more recent one, I figured that I should use this one because it was more updated): Vc-Std_External_20101022

Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:

Benchmark GDB file:

New GDB file:

Note:

There are noticeably more rows in the new GDB than the benchmark GDB file.
The counts are either the same or higher in the new GDB file than they are on the 2010 benchmark GDB file (for the rows that existed for both).

Visual Inspection

Perform visual inspection of individual tables to see if there are any problems.

Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
- Some of the genes don't have a date in the Date field on the new GDB file.

The 2010 benchmark also did not have all the dates for each gene in the Date field.

Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
- For the NEW GDB file:
  - In the UniProt table, there are too many variations in the gene ID formats to actually keep track of all of them. However, they all seem to look like they take the correct form the the IDs they represent
  - In the RefSeq table, the gene IDs are in the following formats:
    - NP_######
    - WP_#########
    - YP_#########
  - In the OrderedLocusNames table, the gene IDs are in the formats:
    - VC_####
    - VC_A####
    - VC####
    - VCA####
- For the 2010 Benchmark GDB file:
  - In the UniProt table, there are too many variations in the gene ID formats to actually keep track of all of them. However, they all seem to look like they take the correct form the the IDs they represent
  - In the RefSeq table, the gene IDs are in the following formats:
    - NP_######
  - In the OrderedLocusNames table, the gene IDs are in the formats:
    - VC_####
    - VC_A####
    - VC####
    - VCA####

Note:

The UniProt table contains 3789 entries in the NEW GDB file. The Benchmark seems to actually be more consistent in terms of ID formatting than the NEW. The Benchmark contains only 3784 entries.
In both the NEW and the Benchmark, the IDs in the RefSeq table seem to have the appropriate formats for the genes.
- There are 6697 entries in the NEW, but there are 3827 in the Benchmark.
- Note: In the NEW file, only YP_008474262 is the gene with this format.
- The Benchmark does not have this ID.
The OrderedLocusNames table has a total of 7664 entries for both the NEW and the Benchmark files (in contrast to TallyEngine, SQL, and Match which only have 3831 entries). However, there is actually twice as much entries here than the other tables since the ID formats with the underscore has a duplicate where the ID does not have an underscore (i.e., if there is a gene with ID VC_1234, there is another similar gene with ID VC1234, but represent the same gene). Halving this number, we get 3832 entries (1 more than what TallyEngine, Match, and SQL gave us).
This extra entry is due to one gene (technically 2 since this gene has a duplicate - one has an underscore and one doesn't) with a format that has not been accounted for. There is a gene with ID VC_A0360.1 (and its corresponding non-underscored ID VCA0360.1).

.gdb Use in GenMAPP

Note: The following steps that I did in order to do this part of the assignment is similar to the protocol I did for the Week 8 assignment. Here are screenshots of the windows that appeared when I did the protocol.

File conversion of raw Merrell data:

The resulting legend window once I've inputted the criterions:

Window for calculating new results (you will need to be patient since you'll be seeing this window for at least 10 minutes - i.e., give it time to load):

Putting a gene on the MAPP using the MAPPFinder window

It was not possible for me to do this section since I always get a run-time error every time I double-clicked on the terms on the MAPPFinder Browser window. Normally, if the error was not present, I would get a small window at the upper left of my screen which shows me the MAPP file of the term. If I had this window, I would be able to click on the terms inside the rectangles and see the backpages. Here's the screenshot of the error:

Note: Because of this error, it was not possible for me to complete the "Coloring a MAPP with expression data" section of this assignment as well.

Creating an Expression Dataset in the Expression Dataset Manager

The total number of IDs that were imported was 5221, the same number of genes that we were processing for the Merrell data for Week 8.
There were 121 exceptions (which can be seen from the screenshot above), the same number of exceptions that the students who worked on the 2010 data received during the Week 8 assignment. The only exception was "Gene not found in OrderedLocusNames or any related system." when I looked at the EX.txt file.
In order to view the XML file, I had to import it to a web browser. I tried using Google Chrome at first, but it always gave me an error so I switched to using Firefox. I dragged the UniProt XML file from my TrixRoq\Working Data folder into a tab on Firefox in order to view what was in it (I had to Google search how to view an XML file and the first thing I found was to simply drag the file into a browser). The link on the search bar on Firefox then became:

file:///T:/TrixRoq/Working%20Data/uniprot-organism%253A243277.xml

since the file came locally on the computer I was using and was located in the T: disk.

I then used the keyboard shortcut for finding specific terms on a page using CTRL + F. This opened a mini search field at the bottom left corner of the browser window. I then copied one of the IDs on the EX.txt file into this search field in order to see if the ID really was in the UniProt XML file. It turns out that these files were not present in the XML file to begin with the same way that the genes were not present in the gene database that we were using for this week's assignment.

Coloring a MAPP with expression data

Note: I could not perform the actions to color a MAPP with expression data (thus, I was also unable to save a MAPP file) since it was giving me runtime errors for every single item I double-click on the MAPPFinder Browser window. The screenshot below is the same for any of the terms I clicked (this also is the same error window screenshot as what is shown in the "Putting a gene on the MAPP using the GeneFinder window" section of this assignment).

The run-time error code is saying that some file is not found although I have no idea what file it is looking for.

Running MAPPFinder

Note: Below are the results of when I entered the gene ID "VC0028" into the "Search for a specific Gene ID" search bar, selected "OrderedLocusNames" in the drop-down that is next to it, and clicked "Gene ID Search":

Clicking on the "Show Ranked List" Menu item in the toolbar, I get the following:

Assignment Links

Weekly Assignments

Individual Journal Entries

Week 1 - This is technically the user page.
Week 2
Week 3
Week 4
Week 5
Week 6
Week 7
Week 8
Week 9
Week 10
Week 11
Week 12
No Week 13 Assignment
Week 14
Week 15

Shared Journal Entries

Troque Week 9

Contents

Files created

Export Information

TallyEngine

Using XMLPipeDB match to Validate the XML Results from the TallyEngine

Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine

OriginalRowCounts Comparison

Visual Inspection

.gdb Use in GenMAPP

Putting a gene on the MAPP using the MAPPFinder window

Creating an Expression Dataset in the Expression Dataset Manager

Coloring a MAPP with expression data

Running MAPPFinder

Assignment Links

Weekly Assignments

Individual Journal Entries

Shared Journal Entries

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools