Msaeedi23 Week 15
Contents
TallyEngine Customization (cw20151203)
In GenMAPP Builder version 3.0.0 Build 5 - cw20151203, the Bordetella pertussis species profile was customized to import 11 ORF gene IDs that were not exported in previous versions. To account for this change, the TallyEngine was customized for Bordetella pertussis to count "ORF" gene listings separately from "Ordered Locus Names". To do this, I followed the procedure documented below:
- First, it was determined that we wanted to count the "ordered locus" IDs and "ORF" IDs from the gene/name tag in the UniProt XML file.
- In the relational database bpertussis_cw20151203_gmb3build5, gene IDs were defined by the type "ordered locus" or "ORF" in the table "genenametype".
- Next, Brandon opened our team's branch of GenMAPP Builder in Eclipse.
- Under edu.lmu.xmlpipedb.gmbuilder.resource.properties, he opened gmbuilder.properties.
- I located the block of text below (it was near the bottom).
# # wizard.properties #
- Brandon added the necessary customizations above this block of text. The resulting code was as follows:
# Bordetella pertussis bordetellapertussis_level_amount=1 bordetellapertussis_element_level0=uniprot/entry/gene/name&type&ORF bordetellapertussis_query_level0=select count(*) from genenametype where type = 'ORF'; bordetellapertussis_table_name_level0=ORF # # wizard.properties #
- Brandon then committed and pushed the changes in the code to Github and created a new distribution of GenMAPP Builder.
- Using the updated build of GenMAPP Builder present in the distribution folder, the relational database was connected bpertussis_cw20151203_gmb3build5 and TallyEngine was run. The results are pictured below:
Testing the Bordetella Pertussis Gene Database (cw20151203)
The full Gene Database Testing Report for the .gdb file tagged cw20151203 can be found here: Gene Database Testing Report- cw20151203. In assessing this gene database with Brandon, we found one gene ID that was not successfully exported into the .gdb file. A summary of this issue and the steps that were taken to detail it is presented below:
- TallyEngine Count
- As described in the "TallyEngine Customization (cw20151203)" section of this page, the expected gene ID count including "Ordered Locus Names" and "ORF" listings was 3446.
- This count was confirmed using the customized TallyEngine.
- XMLPipeDB Match Count
- With the help of Dr. Dionisio, a new regex was crafted to retrieve all possible "ordered locus" and "ORF" gene ID patterns that we identified. The XMLPipeDB Match query and result are pictured below:
- XMLPipeDB Match vs. "Ordered Locus Names" from File:Bpertussis-std cw20151203.zip
- In order to identify the missing gene ID, we compared the XMLPipeDB Match output to the gene IDs listed in the "Ordered Locus Names" table of the file File:Bpertussis-std cw20151203.zip (retrieved using Microsoft Access).
- In Excel, the missing gene ID was identified to be BP3167A:
-
- Interestingly, this gene ID had another unusual variant that we previously documented- BP3167.1.
- Although this gene ID's pattern (BP####A) matched that of the ORF values, it was not present in the list of ORF genes retrieved in PostgreSQL (see Bklein7_Week_14).
-
- Identifying "BP3167A" in the Original XML File
- Based on our TallyEngine and PostgreSQL results, it appeared as though the gene ID "BP3167A" was not listed under the type "ordered locus" or "ORF". To determine its gene type, we opened the original XML file (File:Uniprot-proteome-UP000002676 cw20151201.zip) and searched for "BP3167A":
-
- In the XML file, "BP3167A" was listed with the general gene type "gene ID". This specific designation had not been observed as a stand alone gene type before.
- Nevertheless, the manner in which "BP3167A" was listed in the XML file indicated that it was in fact a proper gene ID and not an artifactual finding. This necessitated further research.
-
- Based on our TallyEngine and PostgreSQL results, it appeared as though the gene ID "BP3167A" was not listed under the type "ordered locus" or "ORF". To determine its gene type, we opened the original XML file (File:Uniprot-proteome-UP000002676 cw20151201.zip) and searched for "BP3167A":
- Researching the Different Forms of "BP3167"
- UniProt
- Searching for "BP3167", "BP3167.1", or "BP3167A" all linked to the following gene page: http://www.uniprot.org/uniprot/Q7VUD4
- The above page specifies that the gene ID "BP3167.1" refers to the gene ureE that codes for Urease accessory protein UreE.
- Searching for "BP3167", "BP3167.1", or "BP3167A" all linked to the following gene page: http://www.uniprot.org/uniprot/Q7VUD4
- EnsemblBacteria-
- Searching for "BP3167" and "BP3167A" retrieves two different results:
- Therefore, the gene ID "BP3167A" is a valid ID that corresponds to the same ID as "BP3167.1" in the UniProt database.
- Conclusion: "BP3167A" is a reference ID from EnsemblBacteria that is valid and must be exported.
- UniProt
Gene Database Testing Report 12/10
The Gene Database Testing Report for this new gene database can be found here: Gene Database Testing Report- cw20151210.
Class Whoopers Team Page
Assignment Links
- Week 1 Assignment
- Week 2 Assignment
- Week 3 Assignment
- Week 4 Assignment
- Week 5 Assignment
- Week 6 Assignment
- Week 7 Assignment
- Week 8 Assignment
- Week 9 Assignment
- Week 10 Assignment
- Week 11 Assignment
- Week 12 Assignment
- Week 14 Assignment
- Week 15 Assignment
Individual Journals
- Week 1
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- Week 13
- Week 14
- Week 15
- Class Journal Week 1
- Class Journal Week 2
- Class Journal Week 3
- Class Journal Week 4
- Class Journal Week 5
- Class Journal Week 6
- Class Journal Week 7
- Class Journal Week 8
- Class Journal Week 9
- The_Class_Whoopers Week 10
- The_Class_Whoopers Week 11
- The_Class_Whoopers Week 12
- The_Class_Whoopers Week 14
- The_Class_Whoopers 15