Difference between revisions of "Msaeedi23 Week 15"
From LMU BioDB 2015
Line 1: | Line 1: | ||
− | == | + | ==TallyEngine Customization (cw20151203)== |
+ | In GenMAPP Builder version 3.0.0 Build 5 - cw20151203, the ''Bordetella pertussis'' species profile was customized to import 11 ORF gene IDs that were not exported in previous versions. To account for this change, the TallyEngine was customized for ''Bordetella pertussis'' to count "ORF" gene listings separately from "Ordered Locus Names". To do this, I followed the procedure documented below: | ||
+ | |||
+ | * First, it was determined that we wanted to count the "ordered locus" IDs and "ORF" IDs from the ''gene/name'' tag in the UniProt XML file. | ||
+ | ** In the relational database ''bpertussis_cw20151203_gmb3build5'', gene IDs were defined by the type "ordered locus" or "ORF" in the table "genenametype". | ||
+ | * Next, Brandon opened our team's branch of GenMAPP Builder in Eclipse. | ||
+ | * Under ''edu.lmu.xmlpipedb.gmbuilder.resource.properties'', he opened ''gmbuilder.properties''. | ||
+ | * I located the block of text below (it was near the bottom). | ||
+ | # | ||
+ | # wizard.properties | ||
+ | # | ||
+ | * Brandon added the necessary customizations above this block of text. The resulting code was as follows: | ||
+ | # Bordetella pertussis | ||
+ | bordetellapertussis_level_amount=1 | ||
+ | |||
+ | bordetellapertussis_element_level0=uniprot/entry/gene/name&type&ORF | ||
+ | |||
+ | bordetellapertussis_query_level0=select count(*) from genenametype where type = 'ORF'; | ||
+ | |||
+ | bordetellapertussis_table_name_level0=ORF | ||
+ | |||
+ | # | ||
+ | # wizard.properties | ||
+ | # | ||
+ | * Brandon then committed and pushed the changes in the code to Github and created a new distribution of GenMAPP Builder. | ||
+ | * Using the updated build of GenMAPP Builder present in the distribution folder, the relational database was connected ''bpertussis_cw20151203_gmb3build5'' and TallyEngine was run. The results are pictured below: | ||
+ | **[[File: Tallyenginecustomization_cw20151203.png]] | ||
+ | ***The TallyEngine results successfully reflected the customizations we made to the TallyEngine, listing all 11 ORF genes in addition to the 3435 "Ordered Locus Names" gene IDs present in the ''Bordetella pertussis'' gene database. | ||
+ | |||
+ | ==Testing the Bordetella Pertussis Gene Database (cw20151203)== | ||
+ | The full Gene Database Testing Report for the .gdb file tagged cw20151203 can be found here: [[Gene Database Testing Report- cw20151203]]. In assessing this gene database with [[User:Bklein7 | Brandon]], we found one gene ID that was not successfully exported into the .gdb file. A summary of this issue and the steps that were taken to detail it is presented below: | ||
+ | |||
+ | *TallyEngine Count | ||
+ | **As described in the "TallyEngine Customization (cw20151203)" section of this page, the expected gene ID count including "Ordered Locus Names" and "ORF" listings was 3446. | ||
+ | **This count was confirmed using the customized TallyEngine. | ||
+ | *XMLPipeDB Match Count | ||
+ | **With the help of [[User:Dondi|Dr. Dionisio]], a new regex was crafted to retrieve all possible "ordered locus" and "ORF" gene ID patterns that we identified. The XMLPipeDB Match query and result are pictured below: | ||
+ | ***[[File:Xmlpipedbmatch cw20151203.png]] | ||
+ | ****To our surprise, XMLPipeDB Match returned a result of 3447 gene IDs that matched our updated regex. | ||
+ | ****Thus, this revealed that one ID matching our regex was not successfully epxorted to the cw20151203 .gdb file. Further investigation was necessary. | ||
+ | *XMLPipeDB Match vs. "Ordered Locus Names" from [[File:Bpertussis-std cw20151203.zip]] | ||
+ | **In order to identify the missing gene ID, we compared the XMLPipeDB Match output to the gene IDs listed in the "Ordered Locus Names" table of the file [[File:Bpertussis-std cw20151203.zip]] (retrieved using Microsoft Access). | ||
+ | ** In Excel, the missing gene ID was identified to be ''BP3167A'': | ||
+ | ***[[File:Xmlpipedbmatch vs gdb cw20151203.PNG]] | ||
+ | ****Interestingly, this gene ID had another unusual variant that we previously documented- BP3167.1. | ||
+ | ****Although this gene ID's pattern (BP####A) matched that of the ORF values, it was not present in the list of ORF genes retrieved in PostgreSQL (see [[Bklein7_Week_14]]). | ||
+ | *Identifying "BP3167A" in the Original XML File | ||
+ | **Based on our TallyEngine and PostgreSQL results, it appeared as though the gene ID "BP3167A" was not listed under the type "ordered locus" or "ORF". To determine its gene type, we opened the original XML file ([[File:Uniprot-proteome-UP000002676 cw20151201.zip]]) and searched for "BP3167A": | ||
+ | ***[[File:MissedID cw20151203.png]] | ||
+ | ****In the XML file, "BP3167A" was listed with the general gene type "gene ID". This specific designation had not been observed as a stand alone gene type before. | ||
+ | ****Nevertheless, the manner in which "BP3167A" was listed in the XML file indicated that it was in fact a proper gene ID and not an artifactual finding. This necessitated further research. | ||
+ | *Researching the Different Forms of "BP3167" | ||
+ | **UniProt | ||
+ | ***Searching for "BP3167", "BP3167.1", or "BP3167A" all linked to the following gene page: http://www.uniprot.org/uniprot/Q7VUD4 | ||
+ | ****The above page specifies that the gene ID "BP3167.1" refers to the gene ureE that codes for Urease accessory protein UreE. | ||
+ | **EnsemblBacteria- | ||
+ | ***Searching for "BP3167" and "BP3167A" retrieves two different results: | ||
+ | ****[[http://bacteria.ensembl.org/Multi/Search/Results?species=all;idx=;q=bp3167;site=ensemblunit BP3167]]- gene ureF is a pseudogene. | ||
+ | ****[[http://bacteria.ensembl.org/Multi/Search/Results?species=all;idx=;q=bp3167A;site=ensemblunit BP3167A]]- gene ureE, which codes for urease accessory protein (as in UniProt). | ||
+ | ***Therefore, the gene ID "BP3167A" is a valid ID that corresponds to the same ID as "BP3167.1" in the UniProt database. | ||
+ | **Conclusion: "BP3167A" is a reference ID from EnsemblBacteria that is valid and must be exported. | ||
+ | |||
+ | {{template: msaeedi23}} |
Revision as of 07:37, 14 December 2015
Contents
TallyEngine Customization (cw20151203)
In GenMAPP Builder version 3.0.0 Build 5 - cw20151203, the Bordetella pertussis species profile was customized to import 11 ORF gene IDs that were not exported in previous versions. To account for this change, the TallyEngine was customized for Bordetella pertussis to count "ORF" gene listings separately from "Ordered Locus Names". To do this, I followed the procedure documented below:
- First, it was determined that we wanted to count the "ordered locus" IDs and "ORF" IDs from the gene/name tag in the UniProt XML file.
- In the relational database bpertussis_cw20151203_gmb3build5, gene IDs were defined by the type "ordered locus" or "ORF" in the table "genenametype".
- Next, Brandon opened our team's branch of GenMAPP Builder in Eclipse.
- Under edu.lmu.xmlpipedb.gmbuilder.resource.properties, he opened gmbuilder.properties.
- I located the block of text below (it was near the bottom).
# # wizard.properties #
- Brandon added the necessary customizations above this block of text. The resulting code was as follows:
# Bordetella pertussis bordetellapertussis_level_amount=1 bordetellapertussis_element_level0=uniprot/entry/gene/name&type&ORF bordetellapertussis_query_level0=select count(*) from genenametype where type = 'ORF'; bordetellapertussis_table_name_level0=ORF # # wizard.properties #
- Brandon then committed and pushed the changes in the code to Github and created a new distribution of GenMAPP Builder.
- Using the updated build of GenMAPP Builder present in the distribution folder, the relational database was connected bpertussis_cw20151203_gmb3build5 and TallyEngine was run. The results are pictured below:
Testing the Bordetella Pertussis Gene Database (cw20151203)
The full Gene Database Testing Report for the .gdb file tagged cw20151203 can be found here: Gene Database Testing Report- cw20151203. In assessing this gene database with Brandon, we found one gene ID that was not successfully exported into the .gdb file. A summary of this issue and the steps that were taken to detail it is presented below:
- TallyEngine Count
- As described in the "TallyEngine Customization (cw20151203)" section of this page, the expected gene ID count including "Ordered Locus Names" and "ORF" listings was 3446.
- This count was confirmed using the customized TallyEngine.
- XMLPipeDB Match Count
- With the help of Dr. Dionisio, a new regex was crafted to retrieve all possible "ordered locus" and "ORF" gene ID patterns that we identified. The XMLPipeDB Match query and result are pictured below:
- XMLPipeDB Match vs. "Ordered Locus Names" from File:Bpertussis-std cw20151203.zip
- In order to identify the missing gene ID, we compared the XMLPipeDB Match output to the gene IDs listed in the "Ordered Locus Names" table of the file File:Bpertussis-std cw20151203.zip (retrieved using Microsoft Access).
- In Excel, the missing gene ID was identified to be BP3167A:
- Interestingly, this gene ID had another unusual variant that we previously documented- BP3167.1.
- Although this gene ID's pattern (BP####A) matched that of the ORF values, it was not present in the list of ORF genes retrieved in PostgreSQL (see Bklein7_Week_14).
- Identifying "BP3167A" in the Original XML File
- Based on our TallyEngine and PostgreSQL results, it appeared as though the gene ID "BP3167A" was not listed under the type "ordered locus" or "ORF". To determine its gene type, we opened the original XML file (File:Uniprot-proteome-UP000002676 cw20151201.zip) and searched for "BP3167A":
- In the XML file, "BP3167A" was listed with the general gene type "gene ID". This specific designation had not been observed as a stand alone gene type before.
- Nevertheless, the manner in which "BP3167A" was listed in the XML file indicated that it was in fact a proper gene ID and not an artifactual finding. This necessitated further research.
- Based on our TallyEngine and PostgreSQL results, it appeared as though the gene ID "BP3167A" was not listed under the type "ordered locus" or "ORF". To determine its gene type, we opened the original XML file (File:Uniprot-proteome-UP000002676 cw20151201.zip) and searched for "BP3167A":
- Researching the Different Forms of "BP3167"
- UniProt
- Searching for "BP3167", "BP3167.1", or "BP3167A" all linked to the following gene page: http://www.uniprot.org/uniprot/Q7VUD4
- The above page specifies that the gene ID "BP3167.1" refers to the gene ureE that codes for Urease accessory protein UreE.
- Searching for "BP3167", "BP3167.1", or "BP3167A" all linked to the following gene page: http://www.uniprot.org/uniprot/Q7VUD4
- EnsemblBacteria-
- Searching for "BP3167" and "BP3167A" retrieves two different results:
- Therefore, the gene ID "BP3167A" is a valid ID that corresponds to the same ID as "BP3167.1" in the UniProt database.
- Conclusion: "BP3167A" is a reference ID from EnsemblBacteria that is valid and must be exported.
- UniProt
Class Whoopers Team Page
Assignment Links
- Week 1 Assignment
- Week 2 Assignment
- Week 3 Assignment
- Week 4 Assignment
- Week 5 Assignment
- Week 6 Assignment
- Week 7 Assignment
- Week 8 Assignment
- Week 9 Assignment
- Week 10 Assignment
- Week 11 Assignment
- Week 12 Assignment
- Week 14 Assignment
- Week 15 Assignment
Individual Journals
- Week 1
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- Week 13
- Week 14
- Week 15
- Class Journal Week 1
- Class Journal Week 2
- Class Journal Week 3
- Class Journal Week 4
- Class Journal Week 5
- Class Journal Week 6
- Class Journal Week 7
- Class Journal Week 8
- Class Journal Week 9
- The_Class_Whoopers Week 10
- The_Class_Whoopers Week 11
- The_Class_Whoopers Week 12
- The_Class_Whoopers Week 14
- The_Class_Whoopers 15