Blitvak Week 14
From LMU BioDB 2015
Revision as of 19:55, 7 December 2015 by Blitvak (Talk | contribs) (added initial analysis/steps, 12/01 work)
Contents
Goals for Week 14
- Consult with Anu to make modifications to TallyEngine/GenMAPP (share initial export results)
- Use Excel to track discrepant IDs (reference: Using Microsoft Excel to Compare ID Lists)
- Conduct gene database exports for any modified versions of GenMAPP builder that are created
- Analyze any conducted exports and perform Q&A work
Initial Export Analysis
Overview of Week 12 findings
- Using XMLPipeDB Match, 7127 unique matches were found that correlated with the OrderedLocusNames IDs outlined at the end of the week 12 assignment
- TallyEngine reported that 337 OrderedLocusNames were present in the XML and within the PSQL database
- Using
select count(*) from genenametype where type = 'ordered locus' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';
, it was verified that 337 OrderedLocusNames entries were present in B.cenocepacia_J2315_20151119_gmb3build5. - By looking at the data present in the genenametype table, it was found that the OrderedLocusName data was in the format of
BceJ2315_#####
Steps taken for further analysis
- The UniProt XML file was opened , via first object XML editor, in order to investigate and verify the location/nature of the OrderedLocusName data.
- A data entry was selected and the related data was looked into:
- In this entry, and in numerous others, it was noticed that only the gene name in the format of
BceJ2315_#####
was tagged as being of the "ordered locus" type. The format that was being focused upon in previous work, that ofp?BCA[M,S,L]###?[A,a]#[A-Z]?
, was labeled as being of the type "ORF". It was noticed that all entires that contained an "ordered locus" gene name also contained an "ORF" name for the same gene; most entries, additionally, lacked an "ordered locus" name and only contained an "ORF" name. - GenMAPP builder, by default, is made to pick up and utilize the ordered locus data within the XML; it was realized that, with respect to the initial export, it was functioning properly. Since the XML data only contained 337 OrderedLocus names, only 337 made it to the database. Since 7127 matches were found, using XMLPipeDB Match, that correlated to an "ORF" name, it is assumed that most of the gene data is ignored by focusing on OrderedLocus names.
- UniProt KB was referenced in order to further verify that all
BceJ2315_#####
gene names were coupled with one that was considered an "ORF" name - A search query was conducted that consisted of
bcej2315 NOT gene:bca*
; this query, it was hoped, would show the number of gene entries that contained just an OrderedLocusName ID.
- UniProt yielded 0 results for query which further suggests that the "ORF" gene name should be focused upon; all entries in UniProt contained an "ORF" name and all of the entries found in the Model Organism Database for B. cenocepacia utilized gene names in the format of
p?BCA[M,S,L]###?[A,a]#[A-Z]?
.- It was decided that GenMAPP builder should be modified so that it focuses, solely, on the "ORF" names within the XML file.
12/1
- Goals: Consult with Anu to make modifications to TallyEngine/GenMAPP (share initial export results)
- Reacquaint with using Excel to track IDs/discrepant IDs (Using Microsoft Excel to Compare ID Lists)
- pgAdmin III work will be involved, in conjunction with Excel, for the new modified .gdb: Will use:
select value from genenametype where type = 'ordered locus';
and will export (and use with Excel)
B.cenocepacia_J2315_20151201_gmbuilder-genialomics-20151201
ORF not ORDEREDLOCUS
- XML - 3.72 minutes
- OBO/XML - 5.25 minutes
- Processing: 3.91 minutes
- GOA - 0.04 minutes
EXPORT START: 10:27 PM
END: 2:49 AM
12/3
'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';
select count(*) from genenametype where type = 'ordered locus' and value ~ 'p?BCA[M,S,L][0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?';
MODIFIED: 'p?BCA[M,S,L]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?';
first object xml editor
6 IDs with problems
bca199f
select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?';
- BUILD 3
- XML - 3.99 minutes
- OBO/XML- 5.77 minutes (1), 4.06 minutes (2)
- GOA - 0.05 minutes
START: 7:24 p END: 12:01 a
- BUILD 4
- XML - 3.46 minutes
- OBO/XML - 5.05 minutes, 3.75 minutes (2)
- GOA - 0.04 minutes
START: 7:51 a END: 6:57 p
'[pBCA,BCAL,BCAS,BCAM][0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?';
'(pBCA)?(BCAL)?(BCAS)?(BCAM)?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?';