LMU BioDB 2015 - User contributions [en]

Blitvak Individual Assessment and Reflection

2015-12-18T23:36:02Z

Blitvak: big -> bug (typo)

==Statement of Work==
#Describe exactly what you did on the project.
#*I contributed to the Gene Database project by figuring out the gene ID patterns related to our species (''B. cenocepacia'' str. J2315), finding the MOD, by conducting gene database exports for any modified versions of GenMAPP Builder, by providing my input towards the creation of modified builds of GenMAPP Builder, and by conducting quality assurance on any exported gene databases. I figured out what was going wrong with the initial and 2nd export of the gene database by looking into the UniProt XML file via an XML editor; these findings contributed to the creation of the final build of GenMAPP Builder by pinpointing a fault with the previously utilized version of GenMAPP Builder (was grabbing "ordered locus" type gene IDs instead of "ORF" type, which led to exported databases that only accounted for 337 genes). I also designed the final commands that were used with Postgres and Match (<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code> for Match, and <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';</code> for Postgres). I worked with Anu, via Q/A, by figuring out what the goals should be for the next modified build of genMAPP Builder. I also used Excel to figure out why the Match results were giving a result that was 5 off from the number of IDs that was represented by the final export database (5 discrepant counts that were accidental matches of text that was unrelated to gene IDs). I think that my most valuable contribution was the export and validation of all of the databases that were created with this project (provided information that was used to fix problems with GenMAPP Builder and led to the creation of a Gene Database that accounted for all of the desired genes). I also worked with Anu to develop the genome paper presentation and with the group in the creation of the final presentation/final paper (focused on the database exports, database validation procedure, and on the IDs).
#Provide references or links to artifacts of your work, such as: Wiki pages, Other files or documents, Code or scripts
===Journals===
*[[Blitvak Week 11|Week 11 Individual Journal]] - Exploration of the MOD and establishment of the gene IDs for J2315
*[[Blitvak Week 12|Week 12 Individual Journal]] - Initial Database Export, Background
*[[Blitvak Week 14|Week 14 Individual Journal]] - Discrepant Match ID analysis with Excel, UniProt XML file exploration that determined the data that should be captured by GenMAPP Builder, Exports of Builds 2, 3, and 4 Gene Databases
*[[Blitvak Week 15|Week 15 Individual Journal]] - Final project work and exploration of the 6993 UniProt entries, compared to the 7121 gene IDs, via PSQL
===Testing Reports===
*[[GÉNialOMICS Gene Database Testing Report (Initial Export)|Initial Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Build 2 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Build 4 Export Testing Report]]
===Files===
*[[Media:Bc-Std GEN BL12 20151119.zip | Compressed Initial Export .gdb]] - (revealed that only 337 genes ended up in the exported database, all of "ordered locus" type)
*[[Media:Bc-Std Build2 GEN BL14 20151201.zip| Compressed Build 2 Export .gdb]] - (by Anu, build 2 added a species profile for J2315)
*[[Media:Bc-Std GEN BL12 20151119.zip | Compressed Build 3 Export .gdb]] - (by Anu, modifications that allowed the capture of ORF data)
*[[Media:Bc-Std GEN Build4 20151204.zip| Compressed Build 4 Export .gdb]] - (by Anu, fixed a bug with TallyEngine that was not representing the ORF genes)
==Assessment of Project==
#What worked and what didn't work?
#*I think that all aspects of this project worked very well and I feel very good about our final exported database and about our biological conclusions! However, it was fairly difficult to begin/finish the final paper during the last week of class. In retrospect, it would have been easier to start some of the earlier sections (like introduction, methods) earlier than later.
#What would you do differently if you could do it all over again?
#*I would have began work on the final deliverables earlier and I would have. Additionally, I realize that it would have been better to immediately open up the UniProt XML after the initial export where most of the gene IDs were not present in the final database/in TallyEngine; this would have led to the identification of the problem earlier in the project (and would have led to a more functional build earlier).
#Content: What is the quality of the work?
#*I think that the final exported gene database for J2315 is comprehensive and well-built. It exhibits a really good level of quality, however, the TallyEngine results could be slightly tweaked in order to remove the ouput of "ordered locus" type gene ID counts (since ORF data was solely focused, the reference of ordered locus data is not necessary). In short, no major issues exist with the gene database. I also feel that our final presentation was well put together. Regarding the final paper, I feel that it is pretty good but it was written under significant time constraints; I think that it would have shown a higher level of polish if it was started earlier/if more time was available. I feel that only minor issues exist with the final paper.
#Organization: Comment on the organization of the project and of your group's wiki pages.
#*I think that our pages are well organized, but I do feel that the testing reports should be placed with their own section/heading. Having weekly summary tables really helped in organizing our workflow and in planning future assignments. With respect to the organization of the project, I would say that we were pretty well organized; all files are accounted for and (from what I saw) all work was documented.
#Completeness: Did your team achieve all of the project objectives? Why or why not?
#*We achieved all of our objectives due to a lot of collaboration and due to some luck (we found some pre-existing genMAPP Builder code that involved the same issue that we were encountering; its modification saved us a lot of time and helped us by providing more time for Q/A or biological analysis via GenMAPP). I really do feel that the whole team was very motivated and excited about this project; that made a lot of difference when we were completing goals/objectives.
==Reflection on the Process==
===What did you learn?===
#With your head (biological or computer science principles)
#*I learned a lot about the functioning, maintenance, and development of biological databases; I also learned a lot about the peer review process through the review of the NAR paper (which was a really exciting project and a new experience). I learned a lot of CS concepts related to text analysis/modification (and database creation) and much about how code works/computers behave. I also learned a lot about the importance of reproducibility and documentation in research through the Baggerly and Coombes example regarding the Duke case (it really drove home the point that data needs to be properly maintained, formatted, and checked; the conclusions of research are as important as the steps that led to them). The Duke case was the first severe case of research fabrication, with serious effects on the health of numerous people, that I became aware of. Through the use of GenMAPP, I came to understand more about bioinformatics and about the value of analyzing gene data with a program like GenMAPP (it made the biological meaning of data much clearer and easier to visualize).
#With your heart (personal qualities and teamwork qualities that make things work or not work)?
#*I came to appreciate biology in light of computer science principles (DNA as biological code). I also learned to communicate and collaborate better with teammates; I feel that teamwork was really crucial in this project (much more so than most "class" projects). Having defined roles for each team member made collaboration a necessity and, through this project, I feel that I have become a better team member. I have learned more about the importance of good communication and the value of dividing work (based on skill-set). This class also made me a bit more determined and more keen on independent exploration (through the weekly assignments and the somewhat open-ended final project). This class also made me realize that I can constructively criticize the work of researchers (with respect to content, statistics, and reproducibility). Seeing the weird statistics (strange significance criteria) related to our microarray paper made me realize that there is a significant amount of research that isn't flawless.
#With your hands (technical skills)?
#*I learned a lot of skills related to the manipulation of text via the command-line, the process of creation and quality assurance tied to databases, and I feel that I became a lot more fluent in Excel. I have also learned how to process microarray data and how to analyze it, biologically, using a program like GenMAPP (and GO terms). I also learned how to manage and manipulate data via postgres tables.
#What lesson will you take away from this project that you will still use a year from now?
#*I really learned the importance of documentation and of research reproducibility (and of good habits related to the management of data). A year from now, I feel that I will still be reading/working with research papers and, using the skills and insights that this class provided, I think I will be able to consciously evaluate the work that was conducted (especially with respect to the provided "workflow"). I think that I will also continue to apply the skills in data management that this class taught (keeping earlier versions, noting dates, and utilizing clear labels). Regarding the group project, I feel that I learned the importance of group communication and of collaboration. A year from now, I will continue to work with other people and I feel that I will still use effective, and clear, group communication in those situations.
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

Blitvak Individual Assessment and Reflection

2015-12-18T23:35:29Z

Blitvak: added templates

==Statement of Work==
#Describe exactly what you did on the project.
#*I contributed to the Gene Database project by figuring out the gene ID patterns related to our species (''B. cenocepacia'' str. J2315), finding the MOD, by conducting gene database exports for any modified versions of GenMAPP Builder, by providing my input towards the creation of modified builds of GenMAPP Builder, and by conducting quality assurance on any exported gene databases. I figured out what was going wrong with the initial and 2nd export of the gene database by looking into the UniProt XML file via an XML editor; these findings contributed to the creation of the final build of GenMAPP Builder by pinpointing a fault with the previously utilized version of GenMAPP Builder (was grabbing "ordered locus" type gene IDs instead of "ORF" type, which led to exported databases that only accounted for 337 genes). I also designed the final commands that were used with Postgres and Match (<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code> for Match, and <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';</code> for Postgres). I worked with Anu, via Q/A, by figuring out what the goals should be for the next modified build of genMAPP Builder. I also used Excel to figure out why the Match results were giving a result that was 5 off from the number of IDs that was represented by the final export database (5 discrepant counts that were accidental matches of text that was unrelated to gene IDs). I think that my most valuable contribution was the export and validation of all of the databases that were created with this project (provided information that was used to fix problems with GenMAPP Builder and led to the creation of a Gene Database that accounted for all of the desired genes). I also worked with Anu to develop the genome paper presentation and with the group in the creation of the final presentation/final paper (focused on the database exports, database validation procedure, and on the IDs).
#Provide references or links to artifacts of your work, such as: Wiki pages, Other files or documents, Code or scripts
===Journals===
*[[Blitvak Week 11|Week 11 Individual Journal]] - Exploration of the MOD and establishment of the gene IDs for J2315
*[[Blitvak Week 12|Week 12 Individual Journal]] - Initial Database Export, Background
*[[Blitvak Week 14|Week 14 Individual Journal]] - Discrepant Match ID analysis with Excel, UniProt XML file exploration that determined the data that should be captured by GenMAPP Builder, Exports of Builds 2, 3, and 4 Gene Databases
*[[Blitvak Week 15|Week 15 Individual Journal]] - Final project work and exploration of the 6993 UniProt entries, compared to the 7121 gene IDs, via PSQL
===Testing Reports===
*[[GÉNialOMICS Gene Database Testing Report (Initial Export)|Initial Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Build 2 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Build 4 Export Testing Report]]
===Files===
*[[Media:Bc-Std GEN BL12 20151119.zip | Compressed Initial Export .gdb]] - (revealed that only 337 genes ended up in the exported database, all of "ordered locus" type)
*[[Media:Bc-Std Build2 GEN BL14 20151201.zip| Compressed Build 2 Export .gdb]] - (by Anu, build 2 added a species profile for J2315)
*[[Media:Bc-Std GEN BL12 20151119.zip | Compressed Build 3 Export .gdb]] - (by Anu, modifications that allowed the capture of ORF data)
*[[Media:Bc-Std GEN Build4 20151204.zip| Compressed Build 4 Export .gdb]] - (by Anu, fixed a big with TallyEngine that was not representing the ORF genes)
==Assessment of Project==
#What worked and what didn't work?
#*I think that all aspects of this project worked very well and I feel very good about our final exported database and about our biological conclusions! However, it was fairly difficult to begin/finish the final paper during the last week of class. In retrospect, it would have been easier to start some of the earlier sections (like introduction, methods) earlier than later.
#What would you do differently if you could do it all over again?
#*I would have began work on the final deliverables earlier and I would have. Additionally, I realize that it would have been better to immediately open up the UniProt XML after the initial export where most of the gene IDs were not present in the final database/in TallyEngine; this would have led to the identification of the problem earlier in the project (and would have led to a more functional build earlier).
#Content: What is the quality of the work?
#*I think that the final exported gene database for J2315 is comprehensive and well-built. It exhibits a really good level of quality, however, the TallyEngine results could be slightly tweaked in order to remove the ouput of "ordered locus" type gene ID counts (since ORF data was solely focused, the reference of ordered locus data is not necessary). In short, no major issues exist with the gene database. I also feel that our final presentation was well put together. Regarding the final paper, I feel that it is pretty good but it was written under significant time constraints; I think that it would have shown a higher level of polish if it was started earlier/if more time was available. I feel that only minor issues exist with the final paper.
#Organization: Comment on the organization of the project and of your group's wiki pages.
#*I think that our pages are well organized, but I do feel that the testing reports should be placed with their own section/heading. Having weekly summary tables really helped in organizing our workflow and in planning future assignments. With respect to the organization of the project, I would say that we were pretty well organized; all files are accounted for and (from what I saw) all work was documented.
#Completeness: Did your team achieve all of the project objectives? Why or why not?
#*We achieved all of our objectives due to a lot of collaboration and due to some luck (we found some pre-existing genMAPP Builder code that involved the same issue that we were encountering; its modification saved us a lot of time and helped us by providing more time for Q/A or biological analysis via GenMAPP). I really do feel that the whole team was very motivated and excited about this project; that made a lot of difference when we were completing goals/objectives.
==Reflection on the Process==
===What did you learn?===
#With your head (biological or computer science principles)
#*I learned a lot about the functioning, maintenance, and development of biological databases; I also learned a lot about the peer review process through the review of the NAR paper (which was a really exciting project and a new experience). I learned a lot of CS concepts related to text analysis/modification (and database creation) and much about how code works/computers behave. I also learned a lot about the importance of reproducibility and documentation in research through the Baggerly and Coombes example regarding the Duke case (it really drove home the point that data needs to be properly maintained, formatted, and checked; the conclusions of research are as important as the steps that led to them). The Duke case was the first severe case of research fabrication, with serious effects on the health of numerous people, that I became aware of. Through the use of GenMAPP, I came to understand more about bioinformatics and about the value of analyzing gene data with a program like GenMAPP (it made the biological meaning of data much clearer and easier to visualize).
#With your heart (personal qualities and teamwork qualities that make things work or not work)?
#*I came to appreciate biology in light of computer science principles (DNA as biological code). I also learned to communicate and collaborate better with teammates; I feel that teamwork was really crucial in this project (much more so than most "class" projects). Having defined roles for each team member made collaboration a necessity and, through this project, I feel that I have become a better team member. I have learned more about the importance of good communication and the value of dividing work (based on skill-set). This class also made me a bit more determined and more keen on independent exploration (through the weekly assignments and the somewhat open-ended final project). This class also made me realize that I can constructively criticize the work of researchers (with respect to content, statistics, and reproducibility). Seeing the weird statistics (strange significance criteria) related to our microarray paper made me realize that there is a significant amount of research that isn't flawless.
#With your hands (technical skills)?
#*I learned a lot of skills related to the manipulation of text via the command-line, the process of creation and quality assurance tied to databases, and I feel that I became a lot more fluent in Excel. I have also learned how to process microarray data and how to analyze it, biologically, using a program like GenMAPP (and GO terms). I also learned how to manage and manipulate data via postgres tables.
#What lesson will you take away from this project that you will still use a year from now?
#*I really learned the importance of documentation and of research reproducibility (and of good habits related to the management of data). A year from now, I feel that I will still be reading/working with research papers and, using the skills and insights that this class provided, I think I will be able to consciously evaluate the work that was conducted (especially with respect to the provided "workflow"). I think that I will also continue to apply the skills in data management that this class taught (keeping earlier versions, noting dates, and utilizing clear labels). Regarding the group project, I feel that I learned the importance of group communication and of collaboration. A year from now, I will continue to work with other people and I feel that I will still use effective, and clear, group communication in those situations.
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

GENialOMICS Deliverables

2015-12-18T23:34:53Z

Blitvak: added reflection

[[Image: Genialomics-banner.jpg | center | 1055px]]
 

= Group Files and Datasets =

* [[Media:Bc-Std GEN Build4 20151204.zip|GenMAPP Gene Database for assigned species (''.gdb'') (compressed)]]
* [[Media:ReadMe_Bc-Std_GEN_Build4_20151214_final.pdf|ReadMe file to accompany the Gene Database (''.pdf'')]]
** [[Media: Genialomics-DatabaseSchema-20151211.pdf|Gene Database Schema Diagram]]
* [[media:GÉNialOMICS_Gene_Database_Testing_Report_(Build_4_Export)_-_LMU_BioDB_2015.pdf|Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)]]
* [[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')]]
* [[media:For_genMAPP_KWVP20151205.txt|Data file used for import into GenMAPP (''.txt'' or ''.csv'')]]
* [[media:KWVP20151205.gex|GenMAPP Expression Dataset file (''.gex'')]]
* [[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file of data imported into GenMAPP (''.EX.txt'')]]
* Raw MAPPFinder results files (''-GO.txt'')
** [[media:KWVP_MAPPfinder_20151208-increase-GO.txt|Increase]]
** [[media:KWVP_MAPPfinder_20151208-decrease-GO.txt|Decrease]]
* [[media:KWVP20151205.gmf|''.gmf'' file]]
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
** [[media:KWVP_MAPPfinder_20151208-increase-GO.xlsx|Increase]]
** [[media:Vpkwmappfinder20151205-Criterion1-GO-decreased.xlsx|Decrease]]
*[[media:Oxphosmappkwvp20151212.mapp|Sample MAPP file of a relevant biological pathway for your species (''.mapp'')]]
* [[Media:BioDBFinalReport-Genialomics.pdf | Group Report]]
* [[Media:Genialomics-BioDBFinalPresentation.pdf|Final PowerPoint presentation]]

==Individual Assessments and Reflections==
*[[Media:ReflectionForFinalProject-AV.pdf | Anindita Varshneya]]
*[[Media:KWreflection.pdf | Kevin Wyllie]]
*[[Blitvak Individual Assessment and Reflection| Brandon Litvak]]

 

{{Template:GÉNialOMICS}}

Blitvak Individual Assessment and Reflection

2015-12-18T23:32:26Z

Blitvak: final version

==Statement of Work==
#Describe exactly what you did on the project.
#*I contributed to the Gene Database project by figuring out the gene ID patterns related to our species (''B. cenocepacia'' str. J2315), finding the MOD, by conducting gene database exports for any modified versions of GenMAPP Builder, by providing my input towards the creation of modified builds of GenMAPP Builder, and by conducting quality assurance on any exported gene databases. I figured out what was going wrong with the initial and 2nd export of the gene database by looking into the UniProt XML file via an XML editor; these findings contributed to the creation of the final build of GenMAPP Builder by pinpointing a fault with the previously utilized version of GenMAPP Builder (was grabbing "ordered locus" type gene IDs instead of "ORF" type, which led to exported databases that only accounted for 337 genes). I also designed the final commands that were used with Postgres and Match (<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code> for Match, and <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';</code> for Postgres). I worked with Anu, via Q/A, by figuring out what the goals should be for the next modified build of genMAPP Builder. I also used Excel to figure out why the Match results were giving a result that was 5 off from the number of IDs that was represented by the final export database (5 discrepant counts that were accidental matches of text that was unrelated to gene IDs). I think that my most valuable contribution was the export and validation of all of the databases that were created with this project (provided information that was used to fix problems with GenMAPP Builder and led to the creation of a Gene Database that accounted for all of the desired genes). I also worked with Anu to develop the genome paper presentation and with the group in the creation of the final presentation/final paper (focused on the database exports, database validation procedure, and on the IDs).
#Provide references or links to artifacts of your work, such as: Wiki pages, Other files or documents, Code or scripts
===Journals===
*[[Blitvak Week 11|Week 11 Individual Journal]] - Exploration of the MOD and establishment of the gene IDs for J2315
*[[Blitvak Week 12|Week 12 Individual Journal]] - Initial Database Export, Background
*[[Blitvak Week 14|Week 14 Individual Journal]] - Discrepant Match ID analysis with Excel, UniProt XML file exploration that determined the data that should be captured by GenMAPP Builder, Exports of Builds 2, 3, and 4 Gene Databases
*[[Blitvak Week 15|Week 15 Individual Journal]] - Final project work and exploration of the 6993 UniProt entries, compared to the 7121 gene IDs, via PSQL
===Testing Reports===
*[[GÉNialOMICS Gene Database Testing Report (Initial Export)|Initial Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Build 2 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Build 4 Export Testing Report]]
===Files===
*[[Media:Bc-Std GEN BL12 20151119.zip | Compressed Initial Export .gdb]] - (revealed that only 337 genes ended up in the exported database, all of "ordered locus" type)
*[[Media:Bc-Std Build2 GEN BL14 20151201.zip| Compressed Build 2 Export .gdb]] - (by Anu, build 2 added a species profile for J2315)
*[[Media:Bc-Std GEN BL12 20151119.zip | Compressed Build 3 Export .gdb]] - (by Anu, modifications that allowed the capture of ORF data)
*[[Media:Bc-Std GEN Build4 20151204.zip| Compressed Build 4 Export .gdb]] - (by Anu, fixed a big with TallyEngine that was not representing the ORF genes)
==Assessment of Project==
#What worked and what didn't work?
#*I think that all aspects of this project worked very well and I feel very good about our final exported database and about our biological conclusions! However, it was fairly difficult to begin/finish the final paper during the last week of class. In retrospect, it would have been easier to start some of the earlier sections (like introduction, methods) earlier than later.
#What would you do differently if you could do it all over again?
#*I would have began work on the final deliverables earlier and I would have. Additionally, I realize that it would have been better to immediately open up the UniProt XML after the initial export where most of the gene IDs were not present in the final database/in TallyEngine; this would have led to the identification of the problem earlier in the project (and would have led to a more functional build earlier).
#Content: What is the quality of the work?
#*I think that the final exported gene database for J2315 is comprehensive and well-built. It exhibits a really good level of quality, however, the TallyEngine results could be slightly tweaked in order to remove the ouput of "ordered locus" type gene ID counts (since ORF data was solely focused, the reference of ordered locus data is not necessary). In short, no major issues exist with the gene database. I also feel that our final presentation was well put together. Regarding the final paper, I feel that it is pretty good but it was written under significant time constraints; I think that it would have shown a higher level of polish if it was started earlier/if more time was available. I feel that only minor issues exist with the final paper.
#Organization: Comment on the organization of the project and of your group's wiki pages.
#*I think that our pages are well organized, but I do feel that the testing reports should be placed with their own section/heading. Having weekly summary tables really helped in organizing our workflow and in planning future assignments. With respect to the organization of the project, I would say that we were pretty well organized; all files are accounted for and (from what I saw) all work was documented.
#Completeness: Did your team achieve all of the project objectives? Why or why not?
#*We achieved all of our objectives due to a lot of collaboration and due to some luck (we found some pre-existing genMAPP Builder code that involved the same issue that we were encountering; its modification saved us a lot of time and helped us by providing more time for Q/A or biological analysis via GenMAPP). I really do feel that the whole team was very motivated and excited about this project; that made a lot of difference when we were completing goals/objectives.
==Reflection on the Process==
===What did you learn?===
#With your head (biological or computer science principles)
#*I learned a lot about the functioning, maintenance, and development of biological databases; I also learned a lot about the peer review process through the review of the NAR paper (which was a really exciting project and a new experience). I learned a lot of CS concepts related to text analysis/modification (and database creation) and much about how code works/computers behave. I also learned a lot about the importance of reproducibility and documentation in research through the Baggerly and Coombes example regarding the Duke case (it really drove home the point that data needs to be properly maintained, formatted, and checked; the conclusions of research are as important as the steps that led to them). The Duke case was the first severe case of research fabrication, with serious effects on the health of numerous people, that I became aware of. Through the use of GenMAPP, I came to understand more about bioinformatics and about the value of analyzing gene data with a program like GenMAPP (it made the biological meaning of data much clearer and easier to visualize).
#With your heart (personal qualities and teamwork qualities that make things work or not work)?
#*I came to appreciate biology in light of computer science principles (DNA as biological code). I also learned to communicate and collaborate better with teammates; I feel that teamwork was really crucial in this project (much more so than most "class" projects). Having defined roles for each team member made collaboration a necessity and, through this project, I feel that I have become a better team member. I have learned more about the importance of good communication and the value of dividing work (based on skill-set). This class also made me a bit more determined and more keen on independent exploration (through the weekly assignments and the somewhat open-ended final project). This class also made me realize that I can constructively criticize the work of researchers (with respect to content, statistics, and reproducibility). Seeing the weird statistics (strange significance criteria) related to our microarray paper made me realize that there is a significant amount of research that isn't flawless.
#With your hands (technical skills)?
#*I learned a lot of skills related to the manipulation of text via the command-line, the process of creation and quality assurance tied to databases, and I feel that I became a lot more fluent in Excel. I have also learned how to process microarray data and how to analyze it, biologically, using a program like GenMAPP (and GO terms). I also learned how to manage and manipulate data via postgres tables.
#What lesson will you take away from this project that you will still use a year from now?
#*I really learned the importance of documentation and of research reproducibility (and of good habits related to the management of data). A year from now, I feel that I will still be reading/working with research papers and, using the skills and insights that this class provided, I think I will be able to consciously evaluate the work that was conducted (especially with respect to the provided "workflow"). I think that I will also continue to apply the skills in data management that this class taught (keeping earlier versions, noting dates, and utilizing clear labels). Regarding the group project, I feel that I learned the importance of group communication and of collaboration. A year from now, I will continue to work with other people and I feel that I will still use effective, and clear, group communication in those situations.

Blitvak Individual Assessment and Reflection

2015-12-18T23:23:18Z

Blitvak: draft 4

==Statement of Work==
#Describe exactly what you did on the project.
#*I contributed to the Gene Database project by figuring out the gene ID patterns related to our species (''B. cenocepacia'' str. J2315), finding the MOD, by conducting gene database exports for any modified versions of GenMAPP Builder, by providing my input towards the creation of modified builds of GenMAPP Builder, and by conducting quality assurance on any exported gene databases. I figured out what was going wrong with the initial and 2nd export of the gene database by looking into the UniProt XML file via an XML editor; these findings contributed to the creation of the final build of GenMAPP Builder by pinpointing a fault with the previously utilized version of GenMAPP Builder (was grabbing "ordered locus" type gene IDs instead of "ORF" type, which led to exported databases that only accounted for 337 genes). I also designed the final commands that were used with Postgres and Match (<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code> for Match, and <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';</code> for Postgres). I worked with Anu, via Q/A, by figuring out what the goals should be for the next modified build of genMAPP Builder. I also used Excel to figure out why the Match results were giving a result that was 5 off from the number of IDs that was represented by the final export database (5 discrepant counts that were accidental matches of text that was unrelated to gene IDs). I think that my most valuable contribution was the export and validation of all of the databases that were created with this project (provided information that was used to fix problems with GenMAPP Builder and led to the creation of a Gene Database that accounted for all of the desired genes). I also worked with Anu to develop the genome paper presentation and with the group in the creation of the final presentation/final paper (focused on the database exports, database validation procedure, and on the IDs).
#Provide references or links to artifacts of your work, such as: Wiki pages, Other files or documents, Code or scripts
===Journals===
*[[Blitvak Week 11|Week 11 Individual Journal]] - Exploration of the MOD and establishment of the gene IDs for J2315
*[[Blitvak Week 12|Week 12 Individual Journal]] - Initial Database Export, Background
*[[Blitvak Week 14|Week 14 Individual Journal]] - Discrepant Match ID analysis with Excel, UniProt XML file exploration that determined the data that should be captured by GenMAPP Builder, Exports of Builds 2, 3, and 4 Gene Databases
*[[Blitvak Week 15|Week 15 Individual Journal]] - Final project work and exploration of the 6993 UniProt entries, compared to the 7121 gene IDs, via PSQL
===Testing Reports===
*[[GÉNialOMICS Gene Database Testing Report (Initial Export)|Initial Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Build 2 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Build 4 Export Testing Report]]
===Files===
*[[Media:Bc-Std GEN BL12 20151119.zip | Compressed Initial Export .gdb]] - (revealed that only 337 genes ended up in the exported database, all of "ordered locus" type)
*[[Media:Bc-Std Build2 GEN BL14 20151201.zip| Compressed Build 2 Export .gdb]] - (by Anu, build 2 added a species profile for J2315)
*[[Media:Bc-Std GEN BL12 20151119.zip | Compressed Build 3 Export .gdb]] - (by Anu, modifications that allowed the capture of ORF data)
*[[Media:Bc-Std GEN Build4 20151204.zip| Compressed Build 4 Export .gdb]] - (by Anu, fixed a big with TallyEngine that was not representing the ORF genes)
==Assessment of Project==
#What worked and what didn't work?
#*I think that all aspects of this project worked very well and I feel very good about our final exported database and about our biological conclusions! However, it was fairly difficult to begin/finish the final paper during the last week of class. In retrospect, it would have been easier to start some of the earlier sections (like introduction, methods) earlier than later.
#What would you do differently if you could do it all over again?
#*I would have began work on the final deliverables earlier and I would have. Additionally, I realize that it would have been better to immediately open up the UniProt XML after the initial export where most of the gene IDs were not present in the final database/in TallyEngine; this would have led to the identification of the problem earlier in the project (and would have led to a more functional build earlier).
#Content: What is the quality of the work?
#*I think that the final exported gene database for J2315 is comprehensive and well-built. It exhibits a really good level of quality, however, the TallyEngine results could be slightly tweaked in order to remove the ouput of "ordered locus" type gene ID counts (since ORF data was solely focused, the reference of ordered locus data is not necessary). In short, no major issues exist with the gene database. I also feel that our final presentation was well put together. Regarding the final paper, I feel that it is pretty good but it was written under significant time constraints; I think that it would have shown a higher level of polish if it was started earlier/if more time was available. I feel that only minor issues exist with the final paper.
#Organization: Comment on the organization of the project and of your group's wiki pages.
#*I think that our pages are well organized, but I do feel that the testing reports should be placed with their own section/heading. Having weekly summary tables really helped in organizing our workflow and in planning future assignments. With respect to the organization of the project, I would say that we were pretty well organized; all files are accounted for and (from what I saw) all work was documented.
#Completeness: Did your team achieve all of the project objectives? Why or why not?
#*We achieved all of our objectives due to a lot of collaboration and due to some luck (we found some pre-existing genMAPP Builder code that involved the same issue that we were encountering; its modification saved us a lot of time and helped us by providing more time for Q/A or biological analysis via GenMAPP). I really do feel that the whole team was very motivated and excited about this project; that made a lot of difference when we were completing goals/objectives.
==Reflection on the Process==
===What did you learn?===
#With your head (biological or computer science principles)
#*I learned a lot about the functioning, maintenance, and development of biological databases; I also learned a lot about the peer review process through the review of the NAR paper (which was a really exciting project and a new experience). I learned a lot of CS concepts related to text analysis/modification (and database creation) and much about how code works/computers behave. I also learned a lot about the importance of reproducibility and documentation in research through the Baggerly and Coombes example regarding the Duke case (it really drove home the point that data needs to be properly maintained, formatted, and checked; the conclusions of research are as important as the steps that led to them). The Duke case was the first severe case of research fabrication, with serious effects on the health of numerous people, that I became aware of. Through the use of GenMAPP, I came to understand more about bioinformatics and about the value of analyzing gene data with a program like GenMAPP (it made the biological meaning of data much clearer and easier to visualize).
#With your heart (personal qualities and teamwork qualities that make things work or not work)?
#*I came to appreciate biology in light of computer science principles (DNA as biological code). I also learned to communicate and collaborate better with teammates; I feel that teamwork was really crucial in this project (much more so than most "class" projects). Having defined roles for each team member made collaboration a necessity and, through this project, I feel that I have become a better team member. I have learned more about the importance of good communication and the value of dividing work (based on skill-set). This class also made me a bit more determined and more keen on independent exploration (through the weekly assignments and the somewhat open-ended final project). This class also made me realize that I can constructively criticize the work of researchers (with respect to content, statistics, and reproducibility). Seeing the weird statistics (strange significance criteria) related to our microarray paper made me realize that there is a significant amount of research that isn't flawless.
#With your hands (technical skills)?
#*I learned a lot of skills related to the manipulation of text via the command-line, the process of creation and quality assurance tied to databases, and I feel that I became a lot more fluent in Excel. I have also learned how to process microarray data and how to analyze it, biologically, using a program like GenMAPP (and GO terms). I also learned how to manage and manipulate data via postgres tables.
#What lesson will you take away from this project that you will still use a year from now?
#*I really learned the importance of documentation and of research reproducibility (and of good habits related to the management of data). A year from now, I feel that I will still be reading/working with research papers and, using the skills and insights that this class provided, I think I will be able to consciously evaluate the work that was conducted (especially with respect to the provided "workflow").

Blitvak Individual Assessment and Reflection

2015-12-18T22:15:20Z

Blitvak: draft 2

==Statement of Work==
#Describe exactly what you did on the project.
#*I contributed to the Gene Database project by figuring out the gene ID patterns related to our species (''B. cenocepacia'' str. J2315), finding the MOD, by conducting gene database exports for any modified versions of GenMAPP Builder, by providing my input towards the creation of modified builds of GenMAPP Builder, and by conducting quality assurance on any exported gene databases. I figured out what was going wrong with the initial and 2nd export of the gene database by looking into the UniProt XML file via an XML editor; these findings contributed to the creation of the final, comprehensive, build of GenMAPP Builder by pinpointing a fault with the utilized version of GenMAPP Builder (was grabbing "ordered locus" type gene IDs instead of "ORF" type, which led to exported databases that only accounted for 337 genes). I also designed the final commands that were used with Postgres and Match (<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code> for Match, and <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';</code> for Postgres). I also used Excel to figure out why the Match results were giving a result that was 5 off from the number of IDs that was represented by the final export database (5 discrepant counts that were accidental matches of text that was unrelated to gene IDs). I think that my most valuable contribution was the export and validation of all of the databases that were created with this project (provided information that was used to fix problems with GenMAPP Builder and led to the creation of a Gene Database that accounted for all of the desired genes).
#Provide references or links to artifacts of your work, such as: Wiki pages, Other files or documents, Code or scripts
===Journals===
*[[Blitvak Week 11|Week 11 Individual Journal]] - Exploration of the MOD and establishment of the gene IDs for J2315
*[[Blitvak Week 12|Week 12 Individual Journal]] - Initial Database Export, Background
*[[Blitvak Week 14|Week 14 Individual Journal]] - Discrepant Match ID analysis with Excel, UniProt XML file exploration that determined the data that should be captured by GenMAPP Builder, Exports of Builds 2, 3, and 4 Gene Databases
*[[Blitvak Week 15|Week 15 Individual Journal]] - Final project work and exploration of the 6993 UniProt entries, compared to the 7121 gene IDs, via PSQL
===Testing Reports===
*[[GÉNialOMICS Gene Database Testing Report (Initial Export)|Initial Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Build 2 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Build 4 Export Testing Report]]
===Files===
*[[Media:Bc-Std GEN BL12 20151119.zip | Compressed Initial Export .gdb]] - (revealed that only 337 genes ended up in the exported database, all of "ordered locus" type)
*[[Media:Bc-Std Build2 GEN BL14 20151201.zip| Compressed Build 2 Export .gdb]] - (by Anu, build 2 added a species profile for J2315)
*[[Media:Bc-Std GEN BL12 20151119.zip | Compressed Build 3 Export .gdb]] - (by Anu, modifications that allowed the capture of ORF data)
*[[Media:Bc-Std GEN Build4 20151204.zip| Compressed Build 4 Export .gdb]] - (by Anu, fixed a big with TallyEngine that was not representing the ORF genes)
==Assessment of Project==
#What worked and what didn't work?
#*I think that all aspects of this project worked very well and I feel very good about our final exported database and about our biological conclusions! However, it was fairly difficult to begin/finish the final paper during the last week of class. In retrospect, it would have been easier to start some of the earlier sections (like introduction, methods) earlier than later.
#What would you do differently if you could do it all over again?
#*I would have began work on the final deliverables earlier and I would have. Additionally, I realize that it would have been better to immediately open up the UniProt XML after the initial export where most of the gene IDs were not present in the final database/in TallyEngine; this would have led to the identification of the problem earlier in the project (and would have led to a more functional build earlier).
#Content: What is the quality of the work?
#*I think that the final exported gene database for J2315 is comprehensive and well-built. It exhibits a really good level of quality, however, the TallyEngine results could be slightly tweaked in order to remove the ouput of "ordered locus" type gene ID counts (since ORF data was solely focused, the reference of ordered locus data is not necessary). In short, no major issues exist with the gene database. I also feel that our final presentation was well put together. Regarding the final paper, I feel that it is pretty good but it was written under significant time constraints; I think that it would have shown a higher level of polish if it was started earlier/if more time was available. I feel that only minor issues exist with the final paper.
#Organization: Comment on the organization of the project and of your group's wiki pages.
#*I think that our pages are well organized, but I do feel that the testing reports should be placed with their own section/heading. Having weekly summary tables really helped in organizing our workflow and in planning future assignments. With respect to the organization of the project, I would say that we were pretty well organized; all files are accounted for and (from what I saw) all work was documented.
#Completeness: Did your team achieve all of the project objectives? Why or why not?
#*We achieved all of our objectives due to a lot of collaboration and due to some luck (we found some pre-existing genMAPP Builder code that involved the same issue that we were encountering; its modification saved us a lot of time and helped us by providing more time for Q/A or biological analysis via GenMAPP). I really do feel that the whole team was very motivated and excited about this project; that made a lot of difference when we were completing goals/objectives.
==Reflection on the Process==
===What did you learn?===
#With your head (biological or computer science principles)
#*I learned a lot about the functioning, maintenance, and development of biological databases; I also learned a lot about the peer review process through the review of the NAR paper (which was a really exciting project and a new experience). I learned a lot of CS concepts related to text analysis/modification (and database creation) and much about how code works/computers behave.
#With your heart (personal qualities and teamwork qualities that make things work or not work)?
#*I came to appreciate biology in light of computer science principles (DNA as biological code). I also learned to communicate and collaborate better with teammates; I feel that teamwork was really crucial in this project (much more so than most "class" projects). Having defined roles for each team member made collaboration a necessity and, through this project, I feel that I have become a better team member.
#With your hands (technical skills)?
#*I learned a lot of skills related to the manipulation of text via the command-line, the process of creation and quality assurance tied to databases, and I feel that I became a lot more fluent in Excel.
#What lesson will you take away from this project that you will still use a year from now?
#*I really learned the importance of documentation and of research reproducibility. This course taught me a lot about proper data management and

Blitvak Individual Assessment and Reflection

2015-12-18T07:03:02Z

Blitvak: draft 1 of reflection deliverable

File:GÉNialOMICS Gene Database Testing Report (Build 4 Export) - LMU BioDB 2015.pdf

2015-12-17T21:29:06Z

Blitvak: Blitvak uploaded a new version of File:GÉNialOMICS Gene Database Testing Report (Build 4 Export) - LMU BioDB 2015.pdf

testing report for build 4, printed as pdf

File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png

2015-12-17T21:26:45Z

Blitvak: Blitvak uploaded a new version of File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png

output of match, week 14, bl

GÉNialOMICS Gene Database Testing Report (Build 4 Export)

2015-12-17T21:18:19Z

Blitvak: cleaned up the testing report in preparation for the final paper

==Export Information==

Version of GenMAPP Builder: [[Media:gmbuilder-genialomics-12032015-build-4.zip|GenMAPP Builder Custom, Build 4]]

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151204_BUILD4_genialomics

UniProt XML filename: [[Media:Uniprot-taxonomy-216591 GEN BL12 20151119.zip|uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml]]
* UniProt XML version: UniProt release 2015_11 - November 11, 2015
* UniProt XML download link: [http://www.uniprot.org/uniprot/?query=organism:216591 UniProtKB link for the complete proteome of J2315]
* Time taken to import: 3.46 minutes
** Note: Time taken appears to be slightly shorter than previous exports.

GO OBO-XML filename: [[Media:Go daily-termdb GEN BL12 20151119.zip|go_daily-termdb_GEN_BL12_20151119.obo-xml]]
* GO OBO-XML version (derived from the date modified on the file, itself): ''Date Modified: 11/19/2015 2:24 AM''
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Link from GO website]
* Time taken to import: 5.05 minutes
* Time taken to process: 3.75 minutes
** Note: Time taken appears to be slightly shorter than previous exports.

GOA filename: [[Media:31277.B cepacia GEN BL12 20151119.zip | 31277.B_cepacia_GEN_BL12_20151119.goa]]
* GOA version: ''Date Modified: 11/10/15, 1:47:00 PM'' (information sourced from FTP site)
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/31277.B_cepacia.goa FTP site file]
* Time taken to import: 0.04 Minutes
** Note: No issues were found with the import of this file.

Name of .gdb file: [[Media:Bc-Std GEN Build4 20151204.zip|Bc-Std GEN Build4 20151204.gdb]]
* Time taken to export: 11 hours 6 minutes
** Start time: 7:51 am
** End time: 6:57 pm
** Note: File was exported without any major issues, however, the export appeared to take significantly longer than the previous exports. It is likely that the export took so long because the workstation had, for some period of time, entered a "sleep" mode (export was delayed, as the computer had to be taken off of "sleep").

==Using TallyEngine==
*PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151204_BUILD4_genialomics was left running
*GenMAPP builder was booted and ''Run XML and Database Tallies for UniProt and GO'' was selected under the ''Tallies'' menu item; the UniProt XML and GO files that were imported were chosen
*'''Results of TallyEngine:'''
*[[File:Build4tallyengine_results_GEN_BL14_20151204.png]]
**Note: These results differ significantly from what was found in previous exports. The 337 Ordered Locus gene names are now distinct from the 7121 ORF gene names (and are represented, as such, by TallyEngine). All of the counts related to external references (like UniProt) remain the same. The major and crucial change is the inclusion and representation of the ORF data.

==Using XMLPipeDB match to Validate the XML Results from the TallyEngine==
*The Windows command line was launched (cmd.exe)
*This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
**NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory). The results were identical to what was found in the [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 export]].
*[[File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png]]
*7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?
*These results vary slightly from what was found by TallyEngine due to the presence of 5 discrepant IDs
*The discrepant IDs, previously identified, are: bca199f, bca5253f, bca636c, bcad837b, bcal0235a, and bcal0239a
*bca199f, bca5253f, bca636c, and bcad837b were found to be a part of a sequence of letters and numbers under the label of "checksum"; these appeared to have been accidentally captured by the utilized Match command.
*bcal0235a and bcal0239a follow the previous identified gene name patterns, however, they both show up as database reference IDs (database reference to STRING, which is a database of known and predicted protein interactions; these data will be ignored as they do not refer to a UniProt entry.
*Excluding these 5 accidental matches, the results found using the Match utility are the same as what was found using TallyEngine
==Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==
*''pgAdmin III'' was booted and all of the necessary connections were made
*In ''pgAdmin III'', the query <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]';</code> was issued via the SQL Query menu in order to validate the TallyEngine count for "Ordered Locus" for the PSQL database.
**337 unique matches were found in ''pgAdmin III'' (postgres database results). This lines up with what was found in TallyEngine.
*Additionally, the query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';</code> was run via SQL in order to the verify the ORF counts
**7121 counts were found which is identical to what was found through XMLPipeDB match (ignoring the discrepant IDs) and to what was reported by TallyEngine (for the ORF data).
*Are your results the same as reported by the TallyEngine? Why or why not?
**The results are the same as what was reported by TallyEngine; this is due to the fact that the most recent build incorporated code fixes that allowed GenMAPP builder, and TallyEngine, to properly include the ORF data in their analysis/work.

==OriginalRowCounts Comparison==
*The newly created J2315 .gdb file was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, [http://www.alexnolan.net/software/mdb_viewer_plus.htm MDB Viewer Plus] was utilized.
*Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
*'''OriginalRowCounts for Build 4 export of J2315'''
**[[File:build4OriginalRowCounts_GEN_BL14_20151204.png]]
*It was decided that a good reference or "benchmark" would be the database that was created using Build 3 of the customized GenMAPP builder; comparing the two should bring into light any issues or differences that could be the result of utilizing an updated version of the modified GenMAPP builder.
*Benchmark .gdb file: [[Media:Bc-Std_GEN_Build3_20151203.zip|compressed Bc-Std_GEN_Build3_20151203.gdb]]
*'''OriginalRowCounts for the Build 3 export of J2315'''
**[[File:Build3OriginalRowCounts GEN BL14 20151203.png]]
Note: It was noticed that the OriginalRowCounts table in this export is identical to the one that came from the Build 3 export. This seems to suggests that the only fundamental difference between the two builds of GenMAPP builder lies with TallyEngine (this makes sense, considering that build 4 focused upon fixing problems with TallyEngine and improper code).

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
**Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
*In the UniProt table, like before, it is apparent that only gene names of the type "ordered locus" are represented (no signs of gene names that begin with something like "BCA"). The RefSeq table appears to not have any problems. The ordered locus names table, like in Build 3, only reflects gene names in the form of <code>p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]</code>; it appears that the "ORF" data replaced the "ordered locus" gene names in this table (these IDs appear to be in the correct and common form).

Note: Visually, no changes seem apparent between the Build 3 and Build 4 export.

==.gdb Use in GenMAPP==
*Some of the protocol from [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] was used as a reference for this portion of the assignment
*''Bc-Std_GEN_Build4_20151204.gdb'' was placed within the Gene Databases folder of the GenMAPP directory (the folder is within the GenMAPP 2 Data folder)
*GenMAPP (Version 2.1) was launched
*The new gene database was loaded by going into ''Data > Choose Gene Database''
*The tab deliminated GenMAPP formatted [[Media:For_genMAPP_KWVP20151205.txt|data]] sourced from the microarray paper was loaded into GenMAPP through ''Data > Expression Dataset Manager > Expression Datasets > New Dataset > ''GenMAPP formatted microarray data_GEN_B14_20151207.txt''

Note: There were no glaring issues with loading the files into GenMAPP (no crashes). However, this gene database led to the detection of 284 errors in the loaded raw data; this error count is identical to what was seen with the build 3 export.

===Putting a gene on the MAPP using the GeneFinder window===
*A test expression data-set was created in order to observe the behavior of GenMAPP with the exported database
*GeneFinder was loaded by placing a blank ''Gene'' element on the drafting board of GenMAPP and right-clicking it.
*The genes BCAL0001,BCAL0002, BCAM0005, and BCAS0105 were searched in the Gene ID box, with the Gene ID System set to OrderedLocusNames
**All genes were successfully found and reference pages with links successfully appeared
Note: All cross-referenced IDs were present for all of these sample Gene IDs. No crashing or issues at this step.

===Creating an Expression Dataset in the Expression Dataset Manager===
*The IDs in the microarray dataset were imported into GenMAPP using the new database; there existed 284 exceptions.
*The EX.txt file was opened through Excel and it was found out that the exceptions were identical to what was found with the Build 3 export.
====Exceptions Analysis====
*'''Note''': This analysis is sourced from the [[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export]]
*The EX.txt file was opened through Excel and it was found out that the error code for all of the exceptions was: ''Gene not found in OrderedLocusNames or any related system.'' The Gene IDs were sorted by error and the problematic IDs were analyzed. It was found, through the find function, that 101 of the exceptions were due to alterations in the usual formatting of the gene name (these gene names contained underscores, Js, and numbers). The rest of the exceptions, it was found (via UniProt KB searches), represented genes that are not present in the UniProt database. Several exceptions (BCAL2591, BCALr0080, BCAM0787, BCAM1951, BCASr0743a) were checked for their presence in UniProt KB or in the MOD:
**BCAL2591: No results in UniProt KB. Found in MOD; gene has no product.
**BCALr0080: No results in UniProt KB. Found in MOD; product: tRNA-Arg.
**BCAM0787: No results in UniProt KB. No results in MOD.
**BCAM1951: No results in UniProt KB. Found in MOD; gene has no product.
**BCASr0743a: No results in UniProt KB. No results in MOD.
*Note: The exceptions file contained error inducing genes that either lack a known product (protein/functional RNA), lack a MOD entry, or code for functional RNA (such as tRNA). Some gene names that contained unusual formatting (BCAL0563_J_0, and BCAL0563_J_1, for example) were found to represent genes that were covered by the MOD/UniProt (these entries were found by removing the unusual underscores/letters and searching the "fixed" gene names).
*[[Media:Exceptions_analysis_from_GenMAPP_builder_GEN_BL15_20151211.xlsx|Excel Workbook utilized in visualizing the GenMAPP exceptions]]
===Running MAPPFinder===
*Protocol sourced from [[Blitvak Week 8 | the week 8 assignment]]
*The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
*"Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
*For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
*The Test criteria was selected
*The boxes corresponding to "Gene Ontology" were checked
*"Browse" button was clicked to add a name to the file that will be created
*"Run MAPPFinder" was clicked and the program was allowed to complete its analysis

===Running MAPPFinder===
*Protocol sourced from [[Blitvak Week 8 | the week 8 assignment]]
*The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
*"Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
*For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
*The Test criteria was selected
*The boxes corresponding to "Gene Ontology" were checked
*"Browse" button was clicked to add a name to the file that will be created
*"Run MAPPFinder" was clicked and the program was allowed to complete its analysis

Note: MAPPFinder successfully loaded and provided an output with this gene database.

==Compare Gene Database to Outside Resource==
Outside Resource: [http://beta.burkholderia.com/ Burkholderia Genome DB], [http://www.uniprot.org/uniprot/ UniProt KB]
*The strain page for J2315 was looked up: [http://beta.burkholderia.com/strain/show/146]
*7121 OrderedLocusNames were found within the exported gdb file. 6,994 entries corresponding to J2315 proteins were found in [http://www.uniprot.org/uniprot/ UniProt KB], and 7114 coding sequences were found in the [http://beta.burkholderia.com/strain/show/146 MOD]. The count of 7121 genes that is represented by the exported database appears to make sense, in light of the data represented by the MOD and by UniProt. Since UniProt is protein-centric, the count of 6994 corresponds to only protein; it is likely that some proteins have several related gene names (which explains the reason why more gene names were found than proteins). The MOD was found, earlier, to be manually curated and it is possible that the difference between the MOD count of 7114 and the found count of 7121 is due to the MOD missing a few genes (that are present in other databases, like UniProt).
*Note: The IDs and counts covered by this export appear to be consistent with outside resources.
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

GENialOMICS

2015-12-15T07:01:45Z

Blitvak: added files for the week

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Create testing reports on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
* It was found that the gene names of interest existed under the label of "ORF" in the XML file (explained why they weren't captured by GenMAPP builder)
* Created a streamlined general expression that can capture all of the IDs of interest; shared with Anu, assisted in the creation of Build 3/4 of the modified genmapp builder
* Found that XMLPipeDB Match gave 6 extra counts using the general expression for the IDs; excel MATCH command analysis was conducted and the discrepant IDs were found
* Completed gene database test reports for builds 2, 3, and 4
|
''Work with Kevin Wyllie''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|
''Work with Veronica Pacheco''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Build 2 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Build 4 Export Testing Report]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:GEN BL14 20151201.zip|Files from work done on 12/01/15]]
*[[Media:GEN BL14 20151203.zip|Files from work done on 12/03/15]]
*[[Media:GEN BL14 20151204.zip|Files from work done on 12/04/15]]
*[[Media:GEN BL14 20151207.zip|Files from work done on 12/07/15]]
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
| see Kevin's section -->
|
*[[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Final processed/formatted microarray data spreadsheet]]
*[[media:For_genMAPP_KWVP20151205.txt|txt form of the forGenMAPP sheet]]
*[[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

==Reflections==
===Anu Varshneya===
*What worked?
**In general, I think our group worked very well together! I think we are all motivated to get this project done well, and are communicating well with each other regarding our progress.
*What didn't work?
**I think for the most part we did a great job. I think the only ideas I have moving forward is a little bit more planning in regards to how we plan to attack the writing and presentation portion of the project. I am not concerned about us getting it done on time and with good quality, just that we create a plan of attack soon so that everyone is on the same page. :)
*What will I do next to fix what didn't work?
**Though nothing has not worked, I think we will just talk tomorrow about how we want to approach the writing and the presentation and set up some group work times.

===Kevin Wyllie===
# What worked?
#* Our initial GenMAPP import worked! 284 errors, which, out of 7251, does sound so bad to me!
# What didn't work?
#* Maybe this isn't actually an example of something not working, but our calculated fold changes were quite different (much lower in magnitude) from those reported in Van Acker et al's paper. However, they had the same directions and generally saw the same relative trends (ie the relatively higher fold changes in the paper were among the higher in our data). Also, very few of the genes they considered significant (with their super-lenient criteria that results in 30% of the genes seeing significant changes) were significant by our criteria.
# What will I do next to fix what didn't work?
#* We just need to triple/quadruple check that our data processing protocol is legitimate. Other than that, there's not much we can do in terms of fold changes. And for statistical significance, we potentially should reconsider heightening our BH P-value threshold above 0.05, as currently we're only considering about 8% of the genes to see a significant change. But maybe this is not too low of a number.

===Veronica Pacheco===
#What worked?
#* Right off the bat, our first run through GenMAPP worked. As expected, there were exceptions and it generated an EX.txt file. There were 284 errors. We then handed over the file to Brandon so we shall see if the number of errors can decrease.
#What didn't work?
#*What we initially thought didn't work or didn't seem correct was that fact that our values for fold changes were much smaller than the values reported in Van Acker et al's paper. The direction, for the majority, aligned with what was reported however the concrete values had large differences.
#What will I do next to fix what didn't work?
#*Initially, we went to make sure our methods were correct. We traced back our steps and made sure the calculations were done correctly.After we double checked, we then sought help from Dr. Dahlquist and Dr. Dionisio. From their response, it seems we should go over it one more time and if there is no source of error on our part, we continue the project with our fold changes.

===Brandon Litvak===
#What worked?
#*I think a lot of things worked this week. Team work and communication was a great help in getting the bulk of this week's work done. The initial exported database was not working as planned and, as a team, we discovered that the reason had to do with the fact that GenMAPP builder was utilizing the wrong type of gene name; this knowledge allowed us to create builds this week that happened to work fairly well. These new builds covered the gene names of interest and led to a relatively small amount errors in GenMAPP. I think that, above all, the thing that worked best this week was my team. We were able to communicate and collaborate very well.
#What didn't work?
#*At the present moment, I can't really think of things that really did not work. With respect to the gene database project for J2315, everything appears to be on track; I would say that the major problems that were encountered in Week 14 were resolved. I feel that all of the major work for the project is complete; all that remains, is to synthesize the work done in a paper and presentation. As a group, we did get little work done on the final deliverables (which should be the focus, for this week) but we did get a lot of valuable work done for the project. We haven't managed to plan much regarding the final deliverables, either (but this is a minor issue).
#What will I do next to fix what didn't work?
#*We will need to meet as a group and discuss the state of the project. I personally feel really good about the work so far and it would be helpful to hear, with the bulk of the work done, how everyone else feels. Additionally, I think that we will need to plan out our approach for the final project as soon as possible. Once we have discussed the project and made a plan, I think that we should set aside some time to work on the group project, as a team. I will check in with the group members on Tuesday, share my major findings for the week, and discuss future courses of action (regarding the last bits of the project).

=Week 15=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Fix any problems with the build
* Finish presentation
* Start paper
|
* Work with Anu to fix any issues with the final gene database
* Finish presentation
* Prepare for presentation
* Work on final paper
* Conduct final checks on last export (check testing report, make any adjustments)
|
''Work with Kevin Wyllie''
* Finish example MAPPs (one for increase and decrease).
* Create sanity check table (with P-values).
* Upload deliverables.
* Begin PowerPoint.
* Begin report.
|
''Work with Veronica Pacheco''
* Finish example MAPPs (one for increase and decrease).
* Create sanity check table (with P-values).
* Upload deliverables.
* Begin PowerPoint.
* Begin report.
|-
!scope="row"|'''Progress'''
|
* Created a new build to fix problem in TallyEngine.
* Contributed to final presentation and practiced speaking.
* Began drafting final report of findings from this project.
* Created ReadMe file.
* Modified Gene Database Schema for B. cenocepacia.
|
* Checked final export: everything appears in order
* Checked new build of GenMAPP builder
* Realized why the UniProt count of 6994 is so different from the one reported by our database (difference between protein entry/gene name)
* Finalized readme
* Helped in finalizing the slides
* Practiced presentation
|
* Finished example MAPP (decided do chose one pathway, oxidative phosphorylation).
* Created sanity check table (with P-values)
* Uploaded deliverables
* Began and finished PowerPoint.
* Began report.
|
* Finished example MAPP (decided do chose one pathway, oxidative phosphorylation).
* Created sanity check table (with P-values)
* Uploaded deliverables
* Began and finished PowerPoint.
* Began report.
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 15]]
|
*[[Blitvak Week 15]]
|
*[[Vpachec3 Week 15]]
|
*[[Kevin Wyllie Week 15]]
|-
!scope="row"|'''Files Used/Created'''
|
*[[Media:ReadMe_Bc-Std_GEN_Build4_20151204.doc.zip | readMe_Bc-Std_GEN_Build4_20151204.doc.zip]] 
*[[Media:gmbuilder-genialomics-20151210-build-5.zip | gmbuilder-genialomics-20151210-build-5.zip]] 
*[[Media:Genialomics-DatabaseSchema-20151211.pdf | Genialomics-DatabaseSchema-20151211.pdf]]
*[[Media:BioDB_Final_Presentation_GEN_20151214.pdf| Final Presentation Slides (.pdf)]]
|
*[[Media:ReadMe Bc-Std GEN Build4 20151214 final.pdf| Final Readme]]
*[[Media:GÉNialOMICS Gene Database Testing Report (Build 4 Export) - LMU BioDB 2015.pdf| Final Testing Report, as PDF]]
*[[Media:BioDB_Final_Presentation_GEN_20151214.pdf| Final Presentation Slides (.pdf)]]
|
*[[Media:BioDB_Final_Presentation_GEN_20151214.pdf| Final Presentation Slides (.pdf)]]
|
* Generated sanity check table.
* Upload deliverables.
*[[Media:BioDB_Final_Presentation_GEN_20151214.pdf| Final Presentation Slides (.pdf)]]
|}

==Other Progress==
* On December 14, 2015, GENialOMICS completed a presentation summarizing their methods and findings.
** [[Media:Genialomics-BioDBFinalPresentation.pdf | GENialOMICS Final Presentation]]

=Deliverables=
*See [[GENialOMICS Deliverables|GENialOMICS Deliverables]]
=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

Blitvak Week 15

2015-12-15T06:52:54Z

Blitvak: added slides

==Work conducted on 12/8==
*The last gene database [[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|testing report]] was reviewed
*It was noticed, once again, that UniProtKB reported 6994 distinct entries while 7121 gene names were found to be in the XML/final database
**This represents a discrepancy of 127
===PSQL investigation of the 127 count discrepancy===
*It was decided that this discrepancy would be investigated through PSQL queries on the initial export Postgres database (which contained all of the 7121 ORF entries of interest) <----Thank you Dondi for the commands/investigation!---->
*The initial export PSQL database, B.cenocepacia_J2315_20151119_gmb3build5, was booted up
*The previously utilized command <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]';</code> was condensed down to <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';</code>; the modified command was executed to confirm the count of 7121
*<code>select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid;</code> was executed to verify the number of UniProt protein entries that are covered by the ORF names; it was found that 6993 entries were present, which is one less from the 6994 reported by UniProt.
**[http://www.uniprot.org/ UniProtKB was checked in order to find this missing entry], using a search [http://www.uniprot.org/uniprot/?query=taxonomy%3A%22Burkholderia+cenocepacia+%28strain+ATCC+BAA-245+%2F+DSM+16553+%2F+LMG+16656+%2F+NCTC+13227+%2F+J2315+%2F+CF5610%29+%28Burkholderia+cepacia+%28strain+J2315%29%29+%5B216591%5D%22+NOT+gene%3Abca*+NOT+gene%3Apbca*&sort=score query] that looked for entries that lacked gene names which are represented by <code>BCA*</code> and <code>pBCA*</code>. It was found that one UniProt entry lacked these usual gene name IDs
**This entry was described as being a '''Proteolysis tag peptide encoded by tmRNA Burkh_cenoc_J2315'''; it appears to be encoded by a transfer-messenger RNA gene, and given that, it lacks a proper gene ID. This peptide will be ignored since it is associated with a functional RNA gene.
*<code>select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid having count(value) > 1;</code> was executed in order to see if any UniProt protein entries were represented by multiple gene names; it was found that numerous entries had a count greater than 1 for corresponding gene names.
*The following command was executed in order to find the total count of gene names represented by the protein entries that had a greater than 1 count:
<code>select sum(dupe_count) from (select genetype_name_hjid, count(value) as dupe_count</code>
<code>from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?'</code>
<code>group by genetype_name_hjid having count(value) > 1 order by count(value) desc) as dupe_tally;</code>
*This led to a sum of 205, which also included the first gene names covered by each entry (not just "extras")
*The following command was executed in order to find the total count of gene names that are not extras (excluding the >1 gene names):
<code>select count(genetype_name_hjid) from (select genetype_name_hjid, count(value) as dupe_count</code>
<code>from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?'</code>
<code>group by genetype_name_hjid having count(value) > 1 order by count(value) desc) as dupe_tally;</code>
*This command led to a count of 77 records
**Given that 6993 entries are represented in the PSQL database, there is a difference of 128 with respect to the 7121 gene name count. The output for the query for extra gene names was a count of 205; 205 minus the number of non-"extra" gene names, which is 77, results in 128. It is now apparent that the difference between the number of UniProt entries and the number of gene name IDs is due to the fact that some proteins are covered by several different gene name IDs.
==Work conducted on 12/10==
*The presentation for the project was worked on
*I checked in with Anu, Kevin, and Veronica regarding the future deliverables
*I helped in making the readme file deliverable
==Work conducted on 12/11==
*The final [[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|testing report]] was reviewed again, cleaned up, and slightly modified with the findings on 12/8 in mind.
*Further work was done on the presentation slides
*It was settled upon that the final commands that will be utilized for the presentation are:
**<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>, and <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';</code>.
==Work conducted on 12/14==
*Met with the group and reviewed the presentation slides (made some minor alterations) and practiced the presentation
*Checked the final readme and finalized it
===Final Presentation Slides===
*[[Media:BioDB_Final_Presentation_GEN_20151214.pdf| Final Presentation Slides (.pdf)]]
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

GENialOMICS

2015-12-15T06:52:07Z

Blitvak: added slides

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Create testing reports on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
* It was found that the gene names of interest existed under the label of "ORF" in the XML file (explained why they weren't captured by GenMAPP builder)
* Created a streamlined general expression that can capture all of the IDs of interest; shared with Anu, assisted in the creation of Build 3/4 of the modified genmapp builder
* Found that XMLPipeDB Match gave 6 extra counts using the general expression for the IDs; excel MATCH command analysis was conducted and the discrepant IDs were found
* Completed gene database test reports for builds 2, 3, and 4
|
''Work with Kevin Wyllie''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|
''Work with Veronica Pacheco''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Build 2 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Build 4 Export Testing Report]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:GEN BL14 20151201.zip|Files from work done on 12/01/15]]
*[[Media:GEN BL14 20151203.zip|Files from work done on 12/03/15]]
*[[Media:GEN BL14 20151204.zip|Files from work done on 12/04/15]]
*[[Media:GEN BL14 20151207.zip|Files from work done on 12/07/15]]
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
| see Kevin's section -->
|
*[[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Final processed/formatted microarray data spreadsheet]]
*[[media:For_genMAPP_KWVP20151205.txt|txt form of the forGenMAPP sheet]]
*[[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

==Reflections==
===Anu Varshneya===
*What worked?
**In general, I think our group worked very well together! I think we are all motivated to get this project done well, and are communicating well with each other regarding our progress.
*What didn't work?
**I think for the most part we did a great job. I think the only ideas I have moving forward is a little bit more planning in regards to how we plan to attack the writing and presentation portion of the project. I am not concerned about us getting it done on time and with good quality, just that we create a plan of attack soon so that everyone is on the same page. :)
*What will I do next to fix what didn't work?
**Though nothing has not worked, I think we will just talk tomorrow about how we want to approach the writing and the presentation and set up some group work times.

===Kevin Wyllie===
# What worked?
#* Our initial GenMAPP import worked! 284 errors, which, out of 7251, does sound so bad to me!
# What didn't work?
#* Maybe this isn't actually an example of something not working, but our calculated fold changes were quite different (much lower in magnitude) from those reported in Van Acker et al's paper. However, they had the same directions and generally saw the same relative trends (ie the relatively higher fold changes in the paper were among the higher in our data). Also, very few of the genes they considered significant (with their super-lenient criteria that results in 30% of the genes seeing significant changes) were significant by our criteria.
# What will I do next to fix what didn't work?
#* We just need to triple/quadruple check that our data processing protocol is legitimate. Other than that, there's not much we can do in terms of fold changes. And for statistical significance, we potentially should reconsider heightening our BH P-value threshold above 0.05, as currently we're only considering about 8% of the genes to see a significant change. But maybe this is not too low of a number.

===Veronica Pacheco===
#What worked?
#* Right off the bat, our first run through GenMAPP worked. As expected, there were exceptions and it generated an EX.txt file. There were 284 errors. We then handed over the file to Brandon so we shall see if the number of errors can decrease.
#What didn't work?
#*What we initially thought didn't work or didn't seem correct was that fact that our values for fold changes were much smaller than the values reported in Van Acker et al's paper. The direction, for the majority, aligned with what was reported however the concrete values had large differences.
#What will I do next to fix what didn't work?
#*Initially, we went to make sure our methods were correct. We traced back our steps and made sure the calculations were done correctly.After we double checked, we then sought help from Dr. Dahlquist and Dr. Dionisio. From their response, it seems we should go over it one more time and if there is no source of error on our part, we continue the project with our fold changes.

===Brandon Litvak===
#What worked?
#*I think a lot of things worked this week. Team work and communication was a great help in getting the bulk of this week's work done. The initial exported database was not working as planned and, as a team, we discovered that the reason had to do with the fact that GenMAPP builder was utilizing the wrong type of gene name; this knowledge allowed us to create builds this week that happened to work fairly well. These new builds covered the gene names of interest and led to a relatively small amount errors in GenMAPP. I think that, above all, the thing that worked best this week was my team. We were able to communicate and collaborate very well.
#What didn't work?
#*At the present moment, I can't really think of things that really did not work. With respect to the gene database project for J2315, everything appears to be on track; I would say that the major problems that were encountered in Week 14 were resolved. I feel that all of the major work for the project is complete; all that remains, is to synthesize the work done in a paper and presentation. As a group, we did get little work done on the final deliverables (which should be the focus, for this week) but we did get a lot of valuable work done for the project. We haven't managed to plan much regarding the final deliverables, either (but this is a minor issue).
#What will I do next to fix what didn't work?
#*We will need to meet as a group and discuss the state of the project. I personally feel really good about the work so far and it would be helpful to hear, with the bulk of the work done, how everyone else feels. Additionally, I think that we will need to plan out our approach for the final project as soon as possible. Once we have discussed the project and made a plan, I think that we should set aside some time to work on the group project, as a team. I will check in with the group members on Tuesday, share my major findings for the week, and discuss future courses of action (regarding the last bits of the project).

=Week 15=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Fix any problems with the build
* Finish presentation
* Start paper
|
* Work with Anu to fix any issues with the final gene database
* Finish presentation
* Prepare for presentation
* Work on final paper
* Conduct final checks on last export (check testing report, make any adjustments)
|
''Work with Kevin Wyllie''
* Finish example MAPPs (one for increase and decrease).
* Create sanity check table (with P-values).
* Upload deliverables.
* Begin PowerPoint.
* Begin report.
|
''Work with Veronica Pacheco''
* Finish example MAPPs (one for increase and decrease).
* Create sanity check table (with P-values).
* Upload deliverables.
* Begin PowerPoint.
* Begin report.
|-
!scope="row"|'''Progress'''
|
* Created a new build to fix problem in TallyEngine.
* Contributed to final presentation and practiced speaking.
* Began drafting final report of findings from this project.
* Created ReadMe file.
* Modified Gene Database Schema for B. cenocepacia.
|
* Checked final export: everything appears in order
* Checked new build of GenMAPP builder
* Realized why the UniProt count of 6994 is so different from the one reported by our database (difference between protein entry/gene name)
|
* Finished example MAPP (decided do chose one pathway, oxidative phosphorylation).
* Created sanity check table (with P-values)
* Uploaded deliverables
* Began and finished PowerPoint.
* Began report.
|
* Finished example MAPP (decided do chose one pathway, oxidative phosphorylation).
* Created sanity check table (with P-values)
* Uploaded deliverables
* Began and finished PowerPoint.
* Began report.
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 15]]
|
*[[Blitvak Week 15]]
|
*[[Vpachec3 Week 15]]
|
*[[Kevin Wyllie Week 15]]
|-
!scope="row"|'''Files Used/Created'''
|
[[Media:ReadMe_Bc-Std_GEN_Build4_20151204.doc.zip | readMe_Bc-Std_GEN_Build4_20151204.doc.zip]] 
[[Media:gmbuilder-genialomics-20151210-build-5.zip | gmbuilder-genialomics-20151210-build-5.zip]] 
[[Media:Genialomics-DatabaseSchema-20151211.pdf | Genialomics-DatabaseSchema-20151211.pdf]]
[[Media:BioDB_Final_Presentation_GEN_20151214.pdf| Final Presentation Slides (.pdf)]]
|
[[Media:BioDB_Final_Presentation_GEN_20151214.pdf| Final Presentation Slides (.pdf)]]
|
[[Media:BioDB_Final_Presentation_GEN_20151214.pdf| Final Presentation Slides (.pdf)]]
|
* Generated sanity check table.
* Upload deliverables.
[[Media:BioDB_Final_Presentation_GEN_20151214.pdf| Final Presentation Slides (.pdf)]]
|}

==Other Progress==
* On December 14, 2015, GENialOMICS completed a presentation summarizing their methods and findings.
** [[Media:Genialomics-BioDBFinalPresentation.pdf | GENialOMICS Final Presentation]]

=Deliverables=
*See [[GENialOMICS Deliverables|GENialOMICS Deliverables]]
=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

File:BioDB Final Presentation GEN 20151214.pdf

2015-12-15T06:50:40Z

Blitvak: slides for final presentation as pdf, genialomics

slides for final presentation as pdf, genialomics

GENialOMICS Deliverables

2015-12-15T06:33:25Z

Blitvak: added links to my stuff (readme, testing report)

== Group Files and Datasets ==

* [[Media:Bc-Std GEN Build4 20151204.zip|GenMAPP Gene Database for assigned species (''.gdb'') (compressed)]]
* [[Media:ReadMe_Bc-Std_GEN_Build4_20151214_final.pdf|ReadMe file to accompany the Gene Database (''.pdf'')]]
** ReadMe includes Gene Database Schema Diagram
* [[media:GÉNialOMICS_Gene_Database_Testing_Report_(Build_4_Export)_-_LMU_BioDB_2015.pdf|Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)]]
* [[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')]]
* [[media:For_genMAPP_KWVP20151205.txt|Data file used for import into GenMAPP (''.txt'' or ''.csv'')]]
* [[media:KWVP20151205.gex|GenMAPP Expression Dataset file (''.gex'')]]
* [[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file of data imported into GenMAPP (''.EX.txt'')]]
* Raw MAPPFinder results files (''-GO.txt'')
** [[media:KWVP_MAPPfinder_20151208-increase-GO.txt|Increase]]
** [[media:KWVP_MAPPfinder_20151208-decrease-GO.txt|Decrease]]
* [[media:KWVP20151205.gmf|''.gmf'' file]]
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
** [[media:KWVP_MAPPfinder_20151208-increase-GO.xlsx|Increase]]
** [[media:Vpkwmappfinder20151205-Criterion1-GO-decreased.xlsx|Decrease]]
*[[media:Oxphosmappkwvp20151212.mapp|Sample MAPP file of a relevant biological pathway for your species (''.mapp'')]]
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* [[Media:Genialomics-BioDBFinalPresentation.pdf|Final PowerPoint presentation]]

File:GÉNialOMICS Gene Database Testing Report (Build 4 Export) - LMU BioDB 2015.pdf

2015-12-15T06:32:12Z

Blitvak: testing report for build 4, printed as pdf

testing report for build 4, printed as pdf

File:ReadMe Bc-Std GEN Build4 20151214 final.pdf

2015-12-15T06:24:02Z

Blitvak: final readme for deliverables, in pdf format

final readme for deliverables, in pdf format

File:ReadMe Bc-Std GEN Build4 20151214.zip

2015-12-15T06:18:31Z

Blitvak: final readme for deliverables section (w/ schema) week 15, bl

final readme for deliverables section (w/ schema) week 15, bl

Blitvak Week 15

2015-12-15T06:10:12Z

Blitvak: added work done 12/14

==Work conducted on 12/8==
*The last gene database [[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|testing report]] was reviewed
*It was noticed, once again, that UniProtKB reported 6994 distinct entries while 7121 gene names were found to be in the XML/final database
**This represents a discrepancy of 127
===PSQL investigation of the 127 count discrepancy===
*It was decided that this discrepancy would be investigated through PSQL queries on the initial export Postgres database (which contained all of the 7121 ORF entries of interest) <----Thank you Dondi for the commands/investigation!---->
*The initial export PSQL database, B.cenocepacia_J2315_20151119_gmb3build5, was booted up
*The previously utilized command <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]';</code> was condensed down to <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';</code>; the modified command was executed to confirm the count of 7121
*<code>select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid;</code> was executed to verify the number of UniProt protein entries that are covered by the ORF names; it was found that 6993 entries were present, which is one less from the 6994 reported by UniProt.
**[http://www.uniprot.org/ UniProtKB was checked in order to find this missing entry], using a search [http://www.uniprot.org/uniprot/?query=taxonomy%3A%22Burkholderia+cenocepacia+%28strain+ATCC+BAA-245+%2F+DSM+16553+%2F+LMG+16656+%2F+NCTC+13227+%2F+J2315+%2F+CF5610%29+%28Burkholderia+cepacia+%28strain+J2315%29%29+%5B216591%5D%22+NOT+gene%3Abca*+NOT+gene%3Apbca*&sort=score query] that looked for entries that lacked gene names which are represented by <code>BCA*</code> and <code>pBCA*</code>. It was found that one UniProt entry lacked these usual gene name IDs
**This entry was described as being a '''Proteolysis tag peptide encoded by tmRNA Burkh_cenoc_J2315'''; it appears to be encoded by a transfer-messenger RNA gene, and given that, it lacks a proper gene ID. This peptide will be ignored since it is associated with a functional RNA gene.
*<code>select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid having count(value) > 1;</code> was executed in order to see if any UniProt protein entries were represented by multiple gene names; it was found that numerous entries had a count greater than 1 for corresponding gene names.
*The following command was executed in order to find the total count of gene names represented by the protein entries that had a greater than 1 count:
<code>select sum(dupe_count) from (select genetype_name_hjid, count(value) as dupe_count</code>
<code>from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?'</code>
<code>group by genetype_name_hjid having count(value) > 1 order by count(value) desc) as dupe_tally;</code>
*This led to a sum of 205, which also included the first gene names covered by each entry (not just "extras")
*The following command was executed in order to find the total count of gene names that are not extras (excluding the >1 gene names):
<code>select count(genetype_name_hjid) from (select genetype_name_hjid, count(value) as dupe_count</code>
<code>from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?'</code>
<code>group by genetype_name_hjid having count(value) > 1 order by count(value) desc) as dupe_tally;</code>
*This command led to a count of 77 records
**Given that 6993 entries are represented in the PSQL database, there is a difference of 128 with respect to the 7121 gene name count. The output for the query for extra gene names was a count of 205; 205 minus the number of non-"extra" gene names, which is 77, results in 128. It is now apparent that the difference between the number of UniProt entries and the number of gene name IDs is due to the fact that some proteins are covered by several different gene name IDs.
==Work conducted on 12/10==
*The presentation for the project was worked on
*I checked in with Anu, Kevin, and Veronica regarding the future deliverables
*I helped in making the readme file deliverable
==Work conducted on 12/11==
*The final [[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|testing report]] was reviewed again, cleaned up, and slightly modified with the findings on 12/8 in mind.
*Further work was done on the presentation slides
*It was settled upon that the final commands that will be utilized for the presentation are:
**<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>, and <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';</code>.
==Work conducted on 12/14==
*Met with the group and reviewed the presentation slides (made some minor alterations) and practiced the presentation
*Checked the final readme and finalized it
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

Blitvak Week 15

2015-12-14T06:36:51Z

Blitvak: added work done so far, added template

==Work conducted on 12/8==
*The last gene database [[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|testing report]] was reviewed
*It was noticed, once again, that UniProtKB reported 6994 distinct entries while 7121 gene names were found to be in the XML/final database
**This represents a discrepancy of 127
===PSQL investigation of the 127 count discrepancy===
*It was decided that this discrepancy would be investigated through PSQL queries on the initial export Postgres database (which contained all of the 7121 ORF entries of interest) <----Thank you Dondi for the commands/investigation!---->
*The initial export PSQL database, B.cenocepacia_J2315_20151119_gmb3build5, was booted up
*The previously utilized command <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]';</code> was condensed down to <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';</code>; the modified command was executed to confirm the count of 7121
*<code>select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid;</code> was executed to verify the number of UniProt protein entries that are covered by the ORF names; it was found that 6993 entries were present, which is one less from the 6994 reported by UniProt.
**[http://www.uniprot.org/ UniProtKB was checked in order to find this missing entry], using a search [http://www.uniprot.org/uniprot/?query=taxonomy%3A%22Burkholderia+cenocepacia+%28strain+ATCC+BAA-245+%2F+DSM+16553+%2F+LMG+16656+%2F+NCTC+13227+%2F+J2315+%2F+CF5610%29+%28Burkholderia+cepacia+%28strain+J2315%29%29+%5B216591%5D%22+NOT+gene%3Abca*+NOT+gene%3Apbca*&sort=score query] that looked for entries that lacked gene names which are represented by <code>BCA*</code> and <code>pBCA*</code>. It was found that one UniProt entry lacked these usual gene name IDs
**This entry was described as being a '''Proteolysis tag peptide encoded by tmRNA Burkh_cenoc_J2315'''; it appears to be encoded by a transfer-messenger RNA gene, and given that, it lacks a proper gene ID. This peptide will be ignored since it is associated with a functional RNA gene.
*<code>select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid having count(value) > 1;</code> was executed in order to see if any UniProt protein entries were represented by multiple gene names; it was found that numerous entries had a count greater than 1 for corresponding gene names.
*The following command was executed in order to find the total count of gene names represented by the protein entries that had a greater than 1 count:
<code>select sum(dupe_count) from (select genetype_name_hjid, count(value) as dupe_count</code>
<code>from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?'</code>
<code>group by genetype_name_hjid having count(value) > 1 order by count(value) desc) as dupe_tally;</code>
*This led to a sum of 205, which also included the first gene names covered by each entry (not just "extras")
*The following command was executed in order to find the total count of gene names that are not extras (excluding the >1 gene names):
<code>select count(genetype_name_hjid) from (select genetype_name_hjid, count(value) as dupe_count</code>
<code>from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?'</code>
<code>group by genetype_name_hjid having count(value) > 1 order by count(value) desc) as dupe_tally;</code>
*This command led to a count of 77 records
**Given that 6993 entries are represented in the PSQL database, there is a difference of 128 with respect to the 7121 gene name count. The output for the query for extra gene names was a count of 205; 205 minus the number of non-"extra" gene names, which is 77, results in 128. It is now apparent that the difference between the number of UniProt entries and the number of gene name IDs is due to the fact that some proteins are covered by several different gene name IDs.
==Work conducted on 12/10==
*The presentation for the project was worked on
*I checked in with Anu, Kevin, and Veronica regarding the future deliverables
*I helped in making the readme file deliverable
==Work conducted on 12/11==
*The final [[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|testing report]] was reviewed again, cleaned up, and slightly modified with the findings on 12/8 in mind.
*Further work was done on the presentation slides
*It was settled upon that the final commands that will be utilized for the presentation are:
**<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>, and <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';</code>.

----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

GENialOMICS

2015-12-14T05:32:01Z

Blitvak: added goals/progress, deliverables link

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Create testing reports on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
* It was found that the gene names of interest existed under the label of "ORF" in the XML file (explained why they weren't captured by GenMAPP builder)
* Created a streamlined general expression that can capture all of the IDs of interest; shared with Anu, assisted in the creation of Build 3/4 of the modified genmapp builder
* Found that XMLPipeDB Match gave 6 extra counts using the general expression for the IDs; excel MATCH command analysis was conducted and the discrepant IDs were found
* Completed gene database test reports for builds 2, 3, and 4
|
''Work with Kevin Wyllie''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|
''Work with Veronica Pacheco''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Build 2 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Build 4 Export Testing Report]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:GEN BL14 20151201.zip|Files from work done on 12/01/15]]
*[[Media:GEN BL14 20151203.zip|Files from work done on 12/03/15]]
*[[Media:GEN BL14 20151204.zip|Files from work done on 12/04/15]]
*[[Media:GEN BL14 20151207.zip|Files from work done on 12/07/15]]
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
| see Kevin's section -->
|
*[[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Final processed/formatted microarray data spreadsheet]]
*[[media:For_genMAPP_KWVP20151205.txt|txt form of the forGenMAPP sheet]]
*[[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

==Reflections==
===Anu Varshneya===
*What worked?
**In general, I think our group worked very well together! I think we are all motivated to get this project done well, and are communicating well with each other regarding our progress.
*What didn't work?
**I think for the most part we did a great job. I think the only ideas I have moving forward is a little bit more planning in regards to how we plan to attack the writing and presentation portion of the project. I am not concerned about us getting it done on time and with good quality, just that we create a plan of attack soon so that everyone is on the same page. :)
*What will I do next to fix what didn't work?
**Though nothing has not worked, I think we will just talk tomorrow about how we want to approach the writing and the presentation and set up some group work times.

===Kevin Wyllie===
# What worked?
#* Our initial GenMAPP import worked! 284 errors, which, out of 7251, does sound so bad to me!
# What didn't work?
#* Maybe this isn't actually an example of something not working, but our calculated fold changes were quite different (much lower in magnitude) from those reported in Van Acker et al's paper. However, they had the same directions and generally saw the same relative trends (ie the relatively higher fold changes in the paper were among the higher in our data). Also, very few of the genes they considered significant (with their super-lenient criteria that results in 30% of the genes seeing significant changes) were significant by our criteria.
# What will I do next to fix what didn't work?
#* We just need to triple/quadruple check that our data processing protocol is legitimate. Other than that, there's not much we can do in terms of fold changes. And for statistical significance, we potentially should reconsider heightening our BH P-value threshold above 0.05, as currently we're only considering about 8% of the genes to see a significant change. But maybe this is not too low of a number.

===Veronica Pacheco===
#What worked?
#* Right off the bat, our first run through GenMAPP worked. As expected, there were exceptions and it generated an EX.txt file. There were 284 errors. We then handed over the file to Brandon so we shall see if the number of errors can decrease.
#What didn't work?
#*What we initially thought didn't work or didn't seem correct was that fact that our values for fold changes were much smaller than the values reported in Van Acker et al's paper. The direction, for the majority, aligned with what was reported however the concrete values had large differences.
#What will I do next to fix what didn't work?
#*Initially, we went to make sure our methods were correct. We traced back our steps and made sure the calculations were done correctly.After we double checked, we then sought help from Dr. Dahlquist and Dr. Dionisio. From their response, it seems we should go over it one more time and if there is no source of error on our part, we continue the project with our fold changes.

===Brandon Litvak===
#What worked?
#*I think a lot of things worked this week. Team work and communication was a great help in getting the bulk of this week's work done. The initial exported database was not working as planned and, as a team, we discovered that the reason had to do with the fact that GenMAPP builder was utilizing the wrong type of gene name; this knowledge allowed us to create builds this week that happened to work fairly well. These new builds covered the gene names of interest and led to a relatively small amount errors in GenMAPP. I think that, above all, the thing that worked best this week was my team. We were able to communicate and collaborate very well.
#What didn't work?
#*At the present moment, I can't really think of things that really did not work. With respect to the gene database project for J2315, everything appears to be on track; I would say that the major problems that were encountered in Week 14 were resolved. I feel that all of the major work for the project is complete; all that remains, is to synthesize the work done in a paper and presentation. As a group, we did get little work done on the final deliverables (which should be the focus, for this week) but we did get a lot of valuable work done for the project. We haven't managed to plan much regarding the final deliverables, either (but this is a minor issue).
#What will I do next to fix what didn't work?
#*We will need to meet as a group and discuss the state of the project. I personally feel really good about the work so far and it would be helpful to hear, with the bulk of the work done, how everyone else feels. Additionally, I think that we will need to plan out our approach for the final project as soon as possible. Once we have discussed the project and made a plan, I think that we should set aside some time to work on the group project, as a team. I will check in with the group members on Tuesday, share my major findings for the week, and discuss future courses of action (regarding the last bits of the project).

=Week 15=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Fix any problems with the build
* Finish presentation
* Start paper
|
* Work with Anu to fix any issues with the final gene database
* Finish presentation
* Prepare for presentation
* Work on final paper
* Conduct final checks on last export (check testing report, make any adjustments)
|
''Work with Kevin Wyllie''
|
''Work with Veronica Pacheco''
* Finish example MAPPs (one for increase and decrease).
* Create sanity check table (with P-values).
* Upload deliverables.
* Begin PowerPoint.
* Begin report.
|-
!scope="row"|'''Progress'''
|
* Created a new build to fix problem in TallyEngine
|
* Checked final export: everything appears in order
* Checked new build of GenMAPP builder
* Realized why the UniProt count of 6994 is so different from the one reported by our database (difference between protein entry/gene name)
|
|
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 15]]
|
*[[Blitvak Week 15]]
|
*[[Vpachec3 Week 15]]
|
*[[Kevin Wyllie Week 15]]
|-
!scope="row"|'''Files Used/Created'''
|
[[Media:ReadMe_Bc-Std_GEN_Build4_20151204.doc.zip | readMe_Bc-Std_GEN_Build4_20151204.doc.zip]] 
[[Media:gmbuilder-genialomics-20151210-build-5.zip | gmbuilder-genialomics-20151210-build-5.zip]] 
[[Media:Genialomics-DatabaseSchema-20151211.pdf | Genialomics-DatabaseSchema-20151211.pdf]]
|
|
|
* Generated sanity check table.
* Upload deliverables.
|}

==Other Progress==

=Deliverables=
*See [[GENialOMICS Deliverables|GENialOMICS Deliverables]]
=== Group Deliverables ===
* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

GÉNialOMICS Gene Database Testing Report (Build 4 Export)

2015-12-12T04:30:41Z

Blitvak: fixed an earlier typing error; replaced used commands with more condensed versions

==Export Information==

Version of GenMAPP Builder: [[Media:gmbuilder-genialomics-12032015-build-4.zip|GenMAPP Builder Custom, Build 4]]

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151204_BUILD4_genialomics

UniProt XML filename: [[Media:Uniprot-taxonomy-216591 GEN BL12 20151119.zip|uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml]]
* UniProt XML version: UniProt release 2015_11 - November 11, 2015
* UniProt XML download link: [http://www.uniprot.org/uniprot/?query=organism:216591 UniProtKB link for the complete proteome of J2315]
* Time taken to import: 3.46 minutes
** Note: Time taken appears to be slightly shorter than previous exports.

GO OBO-XML filename: [[Media:Go daily-termdb GEN BL12 20151119.zip|go_daily-termdb_GEN_BL12_20151119.obo-xml]]
* GO OBO-XML version (derived from the date modified on the file, itself): ''Date Modified: 11/19/2015 2:24 AM''
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Link from GO website]
* Time taken to import: 5.05 minutes
* Time taken to process: 3.75 minutes
** Note: Time taken appears to be slightly shorter than previous exports.

GOA filename: [[Media:31277.B cepacia GEN BL12 20151119.zip | 31277.B_cepacia_GEN_BL12_20151119.goa]]
* GOA version: ''Date Modified: 11/10/15, 1:47:00 PM'' (information sourced from FTP site)
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/31277.B_cepacia.goa FTP site file]
* Time taken to import: 0.04 Minutes
** Note: No issues were found with the import of this file.

Name of .gdb file: [[Media:Bc-Std GEN Build4 20151204.zip|Bc-Std GEN Build4 20151204.gdb]]
* Time taken to export: 11 hours 6 minutes
** Start time: 7:51 am
** End time: 6:57 pm
** Note: File was exported without any major issues, however, the export appeared to take significantly longer than the previous exports. It is likely that the export took so long because the workstation had, for some period of time, entered a "sleep" mode (export was delayed, as the computer had to be taken off of "sleep").

==Using TallyEngine==
*PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151204_BUILD4_genialomics was left running
*GenMAPP builder was booted and ''Run XML and Database Tallies for UniProt and GO'' was selected under the ''Tallies'' menu item; the UniProt XML and GO files that were imported were chosen
*'''Results of TallyEngine:'''
*[[File:Build4tallyengine_results_GEN_BL14_20151204.png]]
**Note: These results differ significantly from what was found in previous exports. The 337 Ordered Locus gene names are now distinct from the 7121 ORF gene names (and are represented, as such, by TallyEngine). All of the counts related to external references (like UniProt) remain the same. The major and crucial change is the inclusion and representation of the ORF data.

==Using XMLPipeDB match to Validate the XML Results from the TallyEngine==
*The Windows command line was launched (cmd.exe)
*This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
**NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory). The results were identical to what was found in the [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 export]].
*[[File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png]]
*7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?
*These results vary slightly from what was found by TallyEngine due to the presence of 5 discrepant IDs (which were identified in the [[Blitvak Week 14|Week 14 assignment]]). Barring those 5 IDs, the results by XMLPipeDB Match line up with what TallyEngine reports (since the Match query grabs ORF data).

==Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==
*''pgAdmin III'' was booted and all of the necessary connections were made
*In ''pgAdmin III'', the query <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]';</code> was issued via the SQL Query menu in order to validate the TallyEngine count for "Ordered Locus" for the PSQL database.
**337 unique matches were found in ''pgAdmin III'' (postgres database results). This lines up with what was found in TallyEngine.
*Additionally, the query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?';</code> was run via SQL in order to the verify the ORF counts
**7121 counts were found which is identical to what was found through XMLPipeDB match (ignoring the discrepant IDs) and to what was reported by TallyEngine (for the ORF data).
*Are your results the same as reported by the TallyEngine? Why or why not?
**The results are now the same as what was reported by TallyEngine; this is due to the fact that the most recent build incorporated code fixes that allowed GenMAPP builder, and TallyEngine, to properly include the ORF data in their analysis/work.

==OriginalRowCounts Comparison==
*The newly created J2315 .gdb file was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, [http://www.alexnolan.net/software/mdb_viewer_plus.htm MDB Viewer Plus] was utilized.
*Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
*'''OriginalRowCounts for Build 4 export of J2315'''
**[[File:build4OriginalRowCounts_GEN_BL14_20151204.png]]
*It was decided that a good reference or "benchmark" would be the database that was created using Build 3 of the customized GenMAPP builder; comparing the two should allow me to see if there was any difference in the imported data (and any difference in the functionality of GenMAPP builder).
*Benchmark .gdb file: [[Media:Bc-Std_GEN_Build3_20151203.zip|compressed Bc-Std_GEN_Build3_20151203.gdb]]
*'''OriginalRowCounts for the Build 3 export of J2315'''
**[[File:Build3OriginalRowCounts GEN BL14 20151203.png]]
Note: It was noticed that the OriginalRowCounts table in this export is identical to the one that came from the Build 3 export. This seems to suggests that the only fundamental difference between the two builds of GenMAPP builder lies with TallyEngine (this makes sense, considering that build 4 focused upon fixing problems with TallyEngine and improper code).

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
**Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
*In the UniProt table, like before, it is apparent that only gene names of the type "ordered locus" are represented (no signs of gene names that begin with something like "BCA"). The RefSeq table appears to not have any problems. The ordered locus names table, like in Build 3, only reflects gene names in the form of <code>p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]</code>; it appears that the "ORF" data replaced the "ordered locus" gene names in this table (these IDs appear to be in the correct and common form).

Note: Visually, no changes seem apparent between the Build 3 and Build 4 export.

==.gdb Use in GenMAPP==
*Some of the protocol from [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] was used as a reference for this portion of the assignment
*''Bc-Std_GEN_Build4_20151204.gdb'' was placed within the Gene Databases folder of the GenMAPP directory (the folder is within the GenMAPP 2 Data folder)
*GenMAPP (Version 2.1) was launched
*The new gene database was loaded by going into ''Data > Choose Gene Database''
*The tab deliminated GenMAPP formatted [[Media:For_genMAPP_KWVP20151205.txt|data]] sourced from the microarray paper was loaded into GenMAPP through ''Data > Expression Dataset Manager > Expression Datasets > New Dataset > ''GenMAPP formatted microarray data_GEN_B14_20151207.txt''

Note: There were no glaring issues with loading the files into GenMAPP (no crashes). However, this gene database led to the detection of 284 errors in the loaded raw data; this error count is identical to what was seen with the build 3 export.

===Putting a gene on the MAPP using the GeneFinder window===
*A test expression data-set was created in order to observe the behavior of GenMAPP with the exported database
*GeneFinder was loaded by placing a blank ''Gene'' element on the drafting board of GenMAPP and right-clicking it.
*The genes BCAL0001,BCAL0002, BCAM0005, and BCAS0105 were searched in the Gene ID box, with the Gene ID System set to OrderedLocusNames
**All genes were successfully found and reference pages with links successfully appeared
Note: All cross-referenced IDs were present for all of these sample Gene IDs. No crashing or issues at this step.

===Creating an Expression Dataset in the Expression Dataset Manager===
*The IDs in the microarray dataset were imported into GenMAPP using the new database; there existed 284 exceptions.
*The EX.txt file was opened through Excel and it was found out that the exceptions were identical to what was found with the Build 3 export.
====Exceptions Analysis====
*'''Note''': This analysis is sourced from the [[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export]]
*The EX.txt file was opened through Excel and it was found out that the error code for all of the exceptions was: ''Gene not found in OrderedLocusNames or any related system.'' The Gene IDs were sorted by error and the problematic IDs were analyzed. It was found, through the find function, that 101 of the exceptions were due to alterations in the usual formatting of the gene name (these gene names contained underscores, Js, and numbers). The rest of the exceptions, it was found (via UniProt KB searches), represented genes that are not present in the UniProt database. Several exceptions (BCAL2591, BCALr0080, BCAM0787, BCAM1951, BCASr0743a) were checked for their presence in UniProt KB or in the MOD:
**BCAL2591: No results in UniProt KB. Found in MOD; gene has no product.
**BCALr0080: No results in UniProt KB. Found in MOD; product: tRNA-Arg.
**BCAM0787: No results in UniProt KB. No results in MOD.
**BCAM1951: No results in UniProt KB. Found in MOD; gene has no product.
**BCASr0743a: No results in UniProt KB. No results in MOD.
*Note: The exceptions file contained error inducing genes that either lack a known product (protein/functional RNA), lack a MOD entry, or code for functional RNA (such as tRNA). Some gene names that contained unusual formatting (BCAL0563_J_0, and BCAL0563_J_1, for example) were found to represent genes that were covered by the MOD/UniProt (these entries were found by removing the unusual underscores/letters and searching the "fixed" gene names).
*[[Media:Exceptions_analysis_from_GenMAPP_builder_GEN_BL15_20151211.xlsx|Excel Workbook utilized in visualizing the GenMAPP exceptions]]
===Running MAPPFinder===
*Protocol sourced from [[Blitvak Week 8 | the week 8 assignment]]
*The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
*"Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
*For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
*The Test criteria was selected
*The boxes corresponding to "Gene Ontology" were checked
*"Browse" button was clicked to add a name to the file that will be created
*"Run MAPPFinder" was clicked and the program was allowed to complete its analysis

===Running MAPPFinder===
*Protocol sourced from [[Blitvak Week 8 | the week 8 assignment]]
*The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
*"Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
*For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
*The Test criteria was selected
*The boxes corresponding to "Gene Ontology" were checked
*"Browse" button was clicked to add a name to the file that will be created
*"Run MAPPFinder" was clicked and the program was allowed to complete its analysis

Note: MAPPFinder successfully loaded and provided an output with this gene database.

==Compare Gene Database to Outside Resource==
Outside Resource: [http://beta.burkholderia.com/ Burkholderia Genome DB], [http://www.uniprot.org/uniprot/ UniProt KB]
*The strain page for J2315 was looked up: [http://beta.burkholderia.com/strain/show/146]
*7121 OrderedLocusNames were found within the exported gdb file. 6,994 entries corresponding to J2315 proteins were found in [http://www.uniprot.org/uniprot/ UniProt KB], and 7114 coding sequences were found in the [http://beta.burkholderia.com/strain/show/146 MOD]. The count of 7121 genes that is represented by the exported database appears to make sense, in light of the data represented by the MOD and by UniProt. Since UniProt is protein-centric, the count of 6994 corresponds to only protein; it is likely that some proteins have several related gene names (which explains the reason why more gene names were found than proteins). The MOD was found, earlier, to be manually curated and it is possible that the difference between the MOD count of 7114 and the found count of 7121 is due to the MOD missing a few genes (that are present in other databases, like UniProt).
*Note: The IDs and counts covered by this export appear to be consistent with outside resources.
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

GÉNialOMICS Gene Database Testing Report (Build 4 Export)

2015-12-12T04:27:06Z

Blitvak: added more detail to the exceptions analysis and to the outside resource section

==Export Information==

Version of GenMAPP Builder: [[Media:gmbuilder-genialomics-12032015-build-4.zip|GenMAPP Builder Custom, Build 4]]

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151204_BUILD4_genialomics

UniProt XML filename: [[Media:Uniprot-taxonomy-216591 GEN BL12 20151119.zip|uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml]]
* UniProt XML version: UniProt release 2015_11 - November 11, 2015
* UniProt XML download link: [http://www.uniprot.org/uniprot/?query=organism:216591 UniProtKB link for the complete proteome of J2315]
* Time taken to import: 3.46 minutes
** Note: Time taken appears to be slightly shorter than previous exports.

GO OBO-XML filename: [[Media:Go daily-termdb GEN BL12 20151119.zip|go_daily-termdb_GEN_BL12_20151119.obo-xml]]
* GO OBO-XML version (derived from the date modified on the file, itself): ''Date Modified: 11/19/2015 2:24 AM''
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Link from GO website]
* Time taken to import: 5.05 minutes
* Time taken to process: 3.75 minutes
** Note: Time taken appears to be slightly shorter than previous exports.

GOA filename: [[Media:31277.B cepacia GEN BL12 20151119.zip | 31277.B_cepacia_GEN_BL12_20151119.goa]]
* GOA version: ''Date Modified: 11/10/15, 1:47:00 PM'' (information sourced from FTP site)
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/31277.B_cepacia.goa FTP site file]
* Time taken to import: 0.04 Minutes
** Note: No issues were found with the import of this file.

Name of .gdb file: [[Media:Bc-Std GEN Build4 20151204.zip|Bc-Std GEN Build4 20151204.gdb]]
* Time taken to export: 11 hours 6 minutes
** Start time: 7:51 am
** End time: 6:57 pm
** Note: File was exported without any major issues, however, the export appeared to take significantly longer than the previous exports. It is likely that the export took so long because the workstation had, for some period of time, entered a "sleep" mode (export was delayed, as the computer had to be taken off of "sleep").

==Using TallyEngine==
*PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151204_BUILD4_genialomics was left running
*GenMAPP builder was booted and ''Run XML and Database Tallies for UniProt and GO'' was selected under the ''Tallies'' menu item; the UniProt XML and GO files that were imported were chosen
*'''Results of TallyEngine:'''
*[[File:Build4tallyengine_results_GEN_BL14_20151204.png]]
**Note: These results differ significantly from what was found in previous exports. The 337 Ordered Locus gene names are now distinct from the 7121 ORF gene names (and are represented, as such, by TallyEngine). All of the counts related to external references (like UniProt) remain the same. The major and crucial change is the inclusion and representation of the ORF data.

==Using XMLPipeDB match to Validate the XML Results from the TallyEngine==
*The Windows command line was launched (cmd.exe)
*This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
**NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory). The results were identical to what was found in the [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 export]].
*[[File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png]]
*7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?
*These results vary slightly from what was found by TallyEngine due to the presence of 5 discrepant IDs (which were identified in the [[Blitvak Week 14|Week 14 assignment]]). Barring those 5 IDs, the results by XMLPipeDB Match line up with what TallyEngine reports (since the Match query grabs ORF data).

==Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==
*''pgAdmin III'' was booted and all of the necessary connections were made
*In ''pgAdmin III'', the query <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]';</code> was issued via the SQL Query menu in order to validate the TallyEngine count for "Ordered Locus" for the PSQL database.
**337 unique matches were found in ''pgAdmin III'' (postgres database results). This lines up with what was found in TallyEngine.
*Additionally, the query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';</code> was run via SQL in order to the verify the ORF counts
**7121 counts were found which is identical to what was found through XMLPipeDB match (ignoring the discrepant IDs) and to what was reported by TallyEngine (for the ORF data).
*Are your results the same as reported by the TallyEngine? Why or why not?
**The results are now the same as what was reported by TallyEngine; this is due to the fact that the most recent build incorporated code fixes that allowed GenMAPP builder, and TallyEngine, to properly include the ORF data in their analysis/work.

==OriginalRowCounts Comparison==
*The newly created J2315 .gdb file (Bc-Std_GEN_BL12_20151201.gdb) was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, [http://www.alexnolan.net/software/mdb_viewer_plus.htm MDB Viewer Plus] was utilized.
*Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
*'''OriginalRowCounts for Build 4 export of J2315'''
**[[File:build4OriginalRowCounts_GEN_BL14_20151204.png]]
*It was decided that a good reference or "benchmark" would be the database that was created using Build 3 of the customized GenMAPP builder; comparing the two should allow me to see if there was any difference in the imported data (and any difference in the functionality of GenMAPP builder).
*Benchmark .gdb file: [[Media:Bc-Std_GEN_Build3_20151203.zip|compressed Bc-Std_GEN_Build3_20151203.gdb]]
*'''OriginalRowCounts for the Build 3 export of J2315'''
**[[File:Build3OriginalRowCounts GEN BL14 20151203.png]]
Note: It was noticed that the OriginalRowCounts table in this export is identical to the one that came from the Build 3 export. This seems to suggests that the only fundamental difference between the two builds of GenMAPP builder lies with TallyEngine (this makes sense, considering that build 4 focused upon fixing problems with TallyEngine and improper code).

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
**Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
*In the UniProt table, like before, it is apparent that only gene names of the type "ordered locus" are represented (no signs of gene names that begin with something like "BCA"). The RefSeq table appears to not have any problems. The ordered locus names table, like in Build 3, only reflects gene names in the form of <code>p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]</code>; it appears that the "ORF" data replaced the "ordered locus" gene names in this table (these IDs appear to be in the correct and common form).

Note: Visually, no changes seem apparent between the Build 3 and Build 4 export.

==.gdb Use in GenMAPP==
*Some of the protocol from [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] was used as a reference for this portion of the assignment
*''Bc-Std_GEN_Build4_20151204.gdb'' was placed within the Gene Databases folder of the GenMAPP directory (the folder is within the GenMAPP 2 Data folder)
*GenMAPP (Version 2.1) was launched
*The new gene database was loaded by going into ''Data > Choose Gene Database''
*The tab deliminated GenMAPP formatted [[Media:For_genMAPP_KWVP20151205.txt|data]] sourced from the microarray paper was loaded into GenMAPP through ''Data > Expression Dataset Manager > Expression Datasets > New Dataset > ''GenMAPP formatted microarray data_GEN_B14_20151207.txt''

Note: There were no glaring issues with loading the files into GenMAPP (no crashes). However, this gene database led to the detection of 284 errors in the loaded raw data; this error count is identical to what was seen with the build 3 export.

===Putting a gene on the MAPP using the GeneFinder window===
*A test expression data-set was created in order to observe the behavior of GenMAPP with the exported database
*GeneFinder was loaded by placing a blank ''Gene'' element on the drafting board of GenMAPP and right-clicking it.
*The genes BCAL0001,BCAL0002, BCAM0005, and BCAS0105 were searched in the Gene ID box, with the Gene ID System set to OrderedLocusNames
**All genes were successfully found and reference pages with links successfully appeared
Note: All cross-referenced IDs were present for all of these sample Gene IDs. No crashing or issues at this step.

===Creating an Expression Dataset in the Expression Dataset Manager===
*The IDs in the microarray dataset were imported into GenMAPP using the new database; there existed 284 exceptions.
*The EX.txt file was opened through Excel and it was found out that the exceptions were identical to what was found with the Build 3 export.
====Exceptions Analysis====
*'''Note''': This analysis is sourced from the [[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export]]
*The EX.txt file was opened through Excel and it was found out that the error code for all of the exceptions was: ''Gene not found in OrderedLocusNames or any related system.'' The Gene IDs were sorted by error and the problematic IDs were analyzed. It was found, through the find function, that 101 of the exceptions were due to alterations in the usual formatting of the gene name (these gene names contained underscores, Js, and numbers). The rest of the exceptions, it was found (via UniProt KB searches), represented genes that are not present in the UniProt database. Several exceptions (BCAL2591, BCALr0080, BCAM0787, BCAM1951, BCASr0743a) were checked for their presence in UniProt KB or in the MOD:
**BCAL2591: No results in UniProt KB. Found in MOD; gene has no product.
**BCALr0080: No results in UniProt KB. Found in MOD; product: tRNA-Arg.
**BCAM0787: No results in UniProt KB. No results in MOD.
**BCAM1951: No results in UniProt KB. Found in MOD; gene has no product.
**BCASr0743a: No results in UniProt KB. No results in MOD.
*Note: The exceptions file contained error inducing genes that either lack a known product (protein/functional RNA), lack a MOD entry, or code for functional RNA (such as tRNA). Some gene names that contained unusual formatting (BCAL0563_J_0, and BCAL0563_J_1, for example) were found to represent genes that were covered by the MOD/UniProt (these entries were found by removing the unusual underscores/letters and searching the "fixed" gene names).
*[[Media:Exceptions_analysis_from_GenMAPP_builder_GEN_BL15_20151211.xlsx|Excel Workbook utilized in visualizing the GenMAPP exceptions]]
===Running MAPPFinder===
*Protocol sourced from [[Blitvak Week 8 | the week 8 assignment]]
*The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
*"Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
*For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
*The Test criteria was selected
*The boxes corresponding to "Gene Ontology" were checked
*"Browse" button was clicked to add a name to the file that will be created
*"Run MAPPFinder" was clicked and the program was allowed to complete its analysis

===Running MAPPFinder===
*Protocol sourced from [[Blitvak Week 8 | the week 8 assignment]]
*The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
*"Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
*For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
*The Test criteria was selected
*The boxes corresponding to "Gene Ontology" were checked
*"Browse" button was clicked to add a name to the file that will be created
*"Run MAPPFinder" was clicked and the program was allowed to complete its analysis

Note: MAPPFinder successfully loaded and provided an output with this gene database.

==Compare Gene Database to Outside Resource==
Outside Resource: [http://beta.burkholderia.com/ Burkholderia Genome DB], [http://www.uniprot.org/uniprot/ UniProt KB]
*The strain page for J2315 was looked up: [http://beta.burkholderia.com/strain/show/146]
*7121 OrderedLocusNames were found within the exported gdb file. 6,994 entries corresponding to J2315 proteins were found in [http://www.uniprot.org/uniprot/ UniProt KB], and 7114 coding sequences were found in the [http://beta.burkholderia.com/strain/show/146 MOD]. The count of 7121 genes that is represented by the exported database appears to make sense, in light of the data represented by the MOD and by UniProt. Since UniProt is protein-centric, the count of 6994 corresponds to only protein; it is likely that some proteins have several related gene names (which explains the reason why more gene names were found than proteins). The MOD was found, earlier, to be manually curated and it is possible that the difference between the MOD count of 7114 and the found count of 7121 is due to the MOD missing a few genes (that are present in other databases, like UniProt).
*Note: The IDs and counts covered by this export appear to be consistent with outside resources.
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

GÉNialOMICS Gene Database Testing Report (Build 3 Export)

2015-12-12T04:17:45Z

Blitvak: fleshed out exceptions analysis section

==Export Information==

Version of GenMAPP Builder: [[Media:gmbuilder-genialomics-12032015-build-3.zip|GenMAPP Builder Custom, Build 3]]

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151203_BUILD3_genialomics

UniProt XML filename: [[Media:Uniprot-taxonomy-216591 GEN BL12 20151119.zip|uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml]]
* UniProt XML version: UniProt release 2015_11 - November 11, 2015
* UniProt XML download link: [http://www.uniprot.org/uniprot/?query=organism:216591 UniProtKB link for the complete proteome of J2315]
* Time taken to import: 3.99 minutes
** Note: No issues were found with the import of this file.

GO OBO-XML filename: [[Media:Go daily-termdb GEN BL12 20151119.zip|go_daily-termdb_GEN_BL12_20151119.obo-xml]]
* GO OBO-XML version (derived from the date modified on the file, itself): ''Date Modified: 11/19/2015 2:24 AM''
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Link from GO website]
* Time taken to import: 5.77 minutes
* Time taken to process: 4.06 minutes
** Note: No issues were found with the import of this file.

GOA filename: [[Media:31277.B cepacia GEN BL12 20151119.zip | 31277.B_cepacia_GEN_BL12_20151119.goa]]
* GOA version: ''Date Modified: 11/10/15, 1:47:00 PM'' (information sourced from FTP site)
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/31277.B_cepacia.goa FTP site file]
* Time taken to import: 0.05 Minutes
** Note: No issues were found with the import of this file.

Name of .gdb file: [[Media:Bc-Std_GEN_Build3_20151203.zip|Bc-Std_GEN_Build3_20151203.gdb]]
* Time taken to export: 4 hours 37 minutes
** Start time: 7.24 pm
** End time: 12:01 am
** Note: File was exported without any major issues, however, the export appeared to take even longer than the one conducted for the [[GÉNialOMICS Gene Database Testing Report (Initial Export) | initial export]]. This export took a little over 2 hours longer than the previous export and it is suspected that this difference is due to the fact that work was being done on the computer while the export was taking place.

==Using TallyEngine==
*PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151203_BUILD3_genialomics was left running
*GenMAPP builder was booted and ''Run XML and Database Tallies for UniProt and GO'' was selected under the ''Tallies'' menu item; the UniProt XML and GO files that were imported were chosen
*'''Results of TallyEngine:'''
*[[File:Build3tallyengine results GEN BL14 20151203.png]]
**Note: These results are identical to what was found in the initial export and in the export involving the second build of a modified genmapp builder (see [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 testing report]]). Since GenMAPP builder was modified, for Build 3, so that the gene names will be collected by the program from the ORF data rather than the ordered locus data, it appears that there exists some errors in the program that are preventing it from properly collecting and taking into account the "ORF" data that resides in the XML file.

==Using XMLPipeDB match to Validate the XML Results from the TallyEngine==
*The Windows command line was launched (cmd.exe)
*This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
**NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory). The results were identical to what was found in the [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 export]].
*[[File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png]]
*7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?
*These results are very different from what was found through TallyEngine because these TallyEngine results, as mentioned in the [[Blitvak Week 14|Week 14 assignment]], only represent the gene name data in the XML that is tagged as being "ordered locus". The match command was found to reflect the data that is found under the "ORF" tag; the ORF and ordered locus counts are both very different, and this is reflected in the difference between TallyEngine and XMLPipeDB match with respect to the gene name counts.

==Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==
*''pgAdmin III'' was booted and all of the necessary connections were made
*It was realized that the ''gene/name'' tags in the XML file end up in the ''genenametype'' table (source: [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways |the wiki page regarding database quality analysis]]
*In ''pgAdmin III'', the query <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]';</code> was issued via the SQL Query menu in order to validate the TallyEngine count for "orderedlocusnames" for the PSQL database.
**337 unique matches were found in ''pgAdmin III'' (postgres database results). This lines up with what was found in TallyEngine.
*Once again, the query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';</code> was run via SQL in order to the verify the ORF counts (compared to the results that were found using XMLPipeDB match, see [[Blitvak Week 14|the week 14 assignment]].
**7121 counts were found which lines up with what was found through XMLPipeDB match and through an analysis of the XML file
*At this point, it was once again assumed that the data in the genenametype table of the PSQL database is identical to what was within the same table in the initial export PSQL database.
*Are your results the same as reported by the TallyEngine? Why or why not?
**The "ordered locus" results are the same as what was reported by TallyEngine since both are focusing on the same set of data. TallyEngine was modified to focus upon the "ORF" data, however, it appears that there are issues that are preventing it from doing so.

==OriginalRowCounts Comparison==
*The newly created J2315 .gdb file was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, [http://www.alexnolan.net/software/mdb_viewer_plus.htm MDB Viewer Plus] was utilized.
*Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
*'''OriginalRowCounts for Build 3 export of J2315'''
**[[File:Build3OriginalRowCounts GEN BL14 20151203.png]]
*It was decided that a good reference or "benchmark" would be the database that was created using Build 2 of the customized GenMAPP builder; comparing the two should allow me to see if there was any difference in the imported data.
*Benchmark .gdb file: [[Media:Bc-Std Build2 GEN BL14 20151201.zip|compressed Bc-Std_Build2_GEN_BL14_20151201.gdb]]
*'''OriginalRowCounts for the Build 2 export of J2315'''
**[[File:Build2ExportOriginalRowCounts GEN BL14 20151201.png]]
Note: It was noticed that the OriginalRowCounts table in this export are mostly identical to the one found through the Build 2 export. However, it was noticed that there existed differences in the OrderedLocusNames table between the two exports. It was found that the recent export, the Build 3 export, contained 7121 rows in the OrderedLocusNames table (which indicates 7121 entries, which is the same as the number of ORF gene names in the XML), while the last export, the Build 2 export, contained 337 rows in the OrderedLocusNames tables. The fact that the build 3 export how shows 7121 entries in that table is indicative of the fact that this modified GenMAPP builder (build 3) is now focusing on the ORF data; it appears, however, that it is now labeling the "ORF" data as being OrderedLocusNames instead of the "ordered locus" data. The observation in the OriginalRowCounts table does not completely mesh with what was found earlier in the PSQL database. In the PSQL database, it was found that the OrderedLocusName data was still the "ordered locus" gene names that reside in the XML (and the "ORF" data are the 7121 gene names of interest). In conclusion, it feels that there are some issues with TallyEngine and GenMAPP builder that are leading to some issues (such as TallyEngine not reporting the ORF data).

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
**Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
*In the UniProt table, like before, it is apparent that only gene names of the type "ordered locus" are represented (no signs of gene names that begin with something like "BCA"). The RefSeq table appears to not have any problems. The ordered locus names table, now, only reflects gene names in the form of <code>p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]</code>; it appears that the "ORF" data replaced the "ordered locus" gene names in this table (these IDs appear to be in the correct and common form).

Note: The modifications to GenMAPP builder appear to have changed some of the data within the tables of the gene database (ORF gene names replacing "ordered locus" gene names, with respect to the OrderedLocusNames table).

==.gdb Use in GenMAPP==
*Some of the protocol from [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] was used as a reference for this portion of the assignment
*''Bc-Std_GEN_Build3_20151203.gdb'' was placed within the Gene Databases folder of the GenMAPP directory (the folder is within the GenMAPP 2 Data folder)
*GenMAPP (Version 2.1) was launched
*The new gene database was loaded by going into ''Data > Choose Gene Database''
*The tab deliminated GenMAPP formatted [[Media:For_genMAPP_KWVP20151205.txt|data]] sourced from the microarray paper was loaded into GenMAPP through ''Data > Expression Dataset Manager > Expression Datasets > New Dataset > ''GenMAPP formatted microarray data_GEN_B14_20151207.txt''

Note: There were no glaring issues with loading the files into GenMAPP (no crashes). However, this gene database led to the detection of 284 errors in the loaded raw data; this new error count is significantly smaller than the 7251 errors that were detected using the previously exported database. Since this export incorporates the ORF data, it appears that the majority of the genes present in the microarray dataset are covered.

===Putting a gene on the MAPP using the GeneFinder window===
*A test expression data-set was created in order to observe the behavior of GenMAPP with the exported database
*GeneFinder was loaded by placing a blank ''Gene'' element on the drafting board of GenMAPP and right-clicking it.
*The genes BCAL0001,BCAL0002, BCAM0005, and BCAS0105 were searched in the Gene ID box, with the Gene ID System set to OrderedLocusNames
**All genes were successfully found and reference pages with links successfully appeared
Note: All cross-referenced IDs were present for all of these sample Gene IDs. No crashing or issues at this step.

===Creating an Expression Dataset in the Expression Dataset Manager===
*The IDs in the microarray dataset were imported into GenMAPP using the new database; there existed 284 exceptions.
*The EX.txt file was opened through Excel and it was found out that the error code for all of the exceptions was: ''Gene not found in OrderedLocusNames or any related system.'' The Gene IDs were sorted by error and the problematic IDs were analyzed. It was found, through the find function, that 101 of the exceptions were due to alterations in the usual formatting of the gene name (these gene names contained underscores, Js, and numbers). The rest of the exceptions, it was found (via UniProt KB searches), represented genes that are not present in the UniProt database. Several exceptions (BCAL2591, BCALr0080, BCAM0787, BCAM1951, BCASr0743a) were checked for their presence in UniProt KB or in the MOD:
**BCAL2591: No results in UniProt KB. Found in MOD; gene has no product.
**BCALr0080: No results in UniProt KB. Found in MOD; product: tRNA-Arg.
**BCAM0787: No results in UniProt KB. No results in MOD.
**BCAM1951: No results in UniProt KB. Found in MOD; gene has no product.
**BCASr0743a: No results in UniProt KB. No results in MOD.
*Note: The exceptions file contained error inducing genes that either lack a known product (protein/functional RNA), lack a MOD entry, or code for functional RNA (such as tRNA). Some gene names that contained unusual formatting (BCAL0563_J_0, and BCAL0563_J_1, for example) were found to represent genes that were covered by the MOD/UniProt (these entries were found by removing the unusual underscores/letters and searching the "fixed" gene names).
*[[Media:Exceptions_analysis_from_GenMAPP_builder_GEN_BL15_20151211.xlsx|Excel Workbook utilized in visualizing the GenMAPP exceptions]]
===Running MAPPFinder===
*Protocol sourced from [[Blitvak Week 8 | the week 8 assignment]]
*The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
*"Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
*For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
*The Test criteria was selected
*The boxes corresponding to "Gene Ontology" were checked
*"Browse" button was clicked to add a name to the file that will be created
*"Run MAPPFinder" was clicked and the program was allowed to complete its analysis

Note: MAPPFinder successfully loaded and provided an output with this gene database.

==Compare Gene Database to Outside Resource==
Outside Resource: [http://beta.burkholderia.com/ Burkholderia Genome DB], [http://www.uniprot.org/uniprot/ UniProt KB]
*The strain page for J2315 was looked up: [http://beta.burkholderia.com/strain/show/146]
*7121 OrderedLocusNames were found within the exported gdb file. 6,994 entries corresponding to protein encoding genes were found in [http://www.uniprot.org/uniprot/ UniProt KB], and 7114 coding sequences were found in the [http://beta.burkholderia.com/strain/show/146 MOD. It is apparent that the count of 7121 (ORF data) is much closer to what is present in outside resources than the one of 337 ("ordered locus" data). The differences in count between UniProt and the gdb and MOD could be the result of the fact that UniProt only covers genes that code for protein (some of the coding sequences present in the MOD, or within the gdb, could be responsible for functional RNA, which are not covered by UniProt).

*Note: The exported database now seems more in-line with what is to be expected of the genome of ''B. cenocepacia''; the current OrderedLocusName counts (which actually represents ORF counts) seem very close to the counts expressed by the MOD and by UniProt.
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

File:Exceptions analysis from GenMAPP builder GEN BL15 20151211.xlsx

2015-12-12T04:15:20Z

Blitvak: file was from week14, forgot to upload it then; will be included in build 3/4 gene database export report

file was from week14, forgot to upload it then; will be included in build 3/4 gene database export report

GÉNialOMICS Gene Database Testing Report (Build 4 Export)

2015-12-12T04:03:53Z

Blitvak: fixed typos and adjusted the exceptions file analysis section

==Export Information==

Version of GenMAPP Builder: [[Media:gmbuilder-genialomics-12032015-build-4.zip|GenMAPP Builder Custom, Build 4]]

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151204_BUILD4_genialomics

UniProt XML filename: [[Media:Uniprot-taxonomy-216591 GEN BL12 20151119.zip|uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml]]
* UniProt XML version: UniProt release 2015_11 - November 11, 2015
* UniProt XML download link: [http://www.uniprot.org/uniprot/?query=organism:216591 UniProtKB link for the complete proteome of J2315]
* Time taken to import: 3.46 minutes
** Note: Time taken appears to be slightly shorter than previous exports.

GO OBO-XML filename: [[Media:Go daily-termdb GEN BL12 20151119.zip|go_daily-termdb_GEN_BL12_20151119.obo-xml]]
* GO OBO-XML version (derived from the date modified on the file, itself): ''Date Modified: 11/19/2015 2:24 AM''
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Link from GO website]
* Time taken to import: 5.05 minutes
* Time taken to process: 3.75 minutes
** Note: Time taken appears to be slightly shorter than previous exports.

GOA filename: [[Media:31277.B cepacia GEN BL12 20151119.zip | 31277.B_cepacia_GEN_BL12_20151119.goa]]
* GOA version: ''Date Modified: 11/10/15, 1:47:00 PM'' (information sourced from FTP site)
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/31277.B_cepacia.goa FTP site file]
* Time taken to import: 0.04 Minutes
** Note: No issues were found with the import of this file.

Name of .gdb file: [[Media:Bc-Std GEN Build4 20151204.zip|Bc-Std GEN Build4 20151204.gdb]]
* Time taken to export: 11 hours 6 minutes
** Start time: 7:51 am
** End time: 6:57 pm
** Note: File was exported without any major issues, however, the export appeared to take significantly longer than the previous exports. It is likely that the export took so long because the workstation had, for some period of time, entered a "sleep" mode (export was delayed, as the computer had to be taken off of "sleep").

==Using TallyEngine==
*PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151204_BUILD4_genialomics was left running
*GenMAPP builder was booted and ''Run XML and Database Tallies for UniProt and GO'' was selected under the ''Tallies'' menu item; the UniProt XML and GO files that were imported were chosen
*'''Results of TallyEngine:'''
*[[File:Build4tallyengine_results_GEN_BL14_20151204.png]]
**Note: These results differ significantly from what was found in previous exports. The 337 Ordered Locus gene names are now distinct from the 7121 ORF gene names (and are represented, as such, by TallyEngine). All of the counts related to external references (like UniProt) remain the same. The major and crucial change is the inclusion and representation of the ORF data.

==Using XMLPipeDB match to Validate the XML Results from the TallyEngine==
*The Windows command line was launched (cmd.exe)
*This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
**NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory). The results were identical to what was found in the [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 export]].
*[[File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png]]
*7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?
*These results vary slightly from what was found by TallyEngine due to the presence of 5 discrepant IDs (which were identified in the [[Blitvak Week 14|Week 14 assignment]]). Barring those 5 IDs, the results by XMLPipeDB Match line up with what TallyEngine reports (since the Match query grabs ORF data).

==Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==
*''pgAdmin III'' was booted and all of the necessary connections were made
*In ''pgAdmin III'', the query <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]';</code> was issued via the SQL Query menu in order to validate the TallyEngine count for "Ordered Locus" for the PSQL database.
**337 unique matches were found in ''pgAdmin III'' (postgres database results). This lines up with what was found in TallyEngine.
*Additionally, the query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';</code> was run via SQL in order to the verify the ORF counts
**7121 counts were found which is identical to what was found through XMLPipeDB match (ignoring the discrepant IDs) and to what was reported by TallyEngine (for the ORF data).
*Are your results the same as reported by the TallyEngine? Why or why not?
**The results are now the same as what was reported by TallyEngine; this is due to the fact that the most recent build incorporated code fixes that allowed GenMAPP builder, and TallyEngine, to properly include the ORF data in their analysis/work.

==OriginalRowCounts Comparison==
*The newly created J2315 .gdb file was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, [http://www.alexnolan.net/software/mdb_viewer_plus.htm MDB Viewer Plus] was utilized.
*Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
*'''OriginalRowCounts for Build 4 export of J2315'''
**[[File:build4OriginalRowCounts_GEN_BL14_20151204.png]]
*It was decided that a good reference or "benchmark" would be the database that was created using Build 3 of the customized GenMAPP builder; comparing the two should allow me to see if there was any difference in the imported data (and any difference in the functionality of GenMAPP builder).
*Benchmark .gdb file: [[Media:Bc-Std_GEN_Build3_20151203.zip|compressed Bc-Std_GEN_Build3_20151203.gdb]]
*'''OriginalRowCounts for the Build 3 export of J2315'''
**[[File:Build3OriginalRowCounts GEN BL14 20151203.png]]
Note: It was noticed that the OriginalRowCounts table in this export is identical to the one that came from the Build 3 export. This seems to suggests that the only fundamental difference between the two builds of GenMAPP builder lies with TallyEngine (this makes sense, considering that build 4 focused upon fixing problems with TallyEngine and improper code).

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
**Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
*In the UniProt table, like before, it is apparent that only gene names of the type "ordered locus" are represented (no signs of gene names that begin with something like "BCA"). The RefSeq table appears to not have any problems. The ordered locus names table, like in Build 3, only reflects gene names in the form of <code>p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]</code>; it appears that the "ORF" data replaced the "ordered locus" gene names in this table (these IDs appear to be in the correct and common form).

Note: Visually, no changes seem apparent between the Build 3 and Build 4 export.

==.gdb Use in GenMAPP==
*Some of the protocol from [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] was used as a reference for this portion of the assignment
*''Bc-Std_GEN_Build4_20151204.gdb'' was placed within the Gene Databases folder of the GenMAPP directory (the folder is within the GenMAPP 2 Data folder)
*GenMAPP (Version 2.1) was launched
*The new gene database was loaded by going into ''Data > Choose Gene Database''
*The tab deliminated GenMAPP formatted [[Media:For_genMAPP_KWVP20151205.txt|data]] sourced from the microarray paper was loaded into GenMAPP through ''Data > Expression Dataset Manager > Expression Datasets > New Dataset > ''GenMAPP formatted microarray data_GEN_B14_20151207.txt''

Note: There were no glaring issues with loading the files into GenMAPP (no crashes). However, this gene database led to the detection of 284 errors in the loaded raw data; this error count is identical to what was seen with the build 3 export.

===Putting a gene on the MAPP using the GeneFinder window===
*A test expression data-set was created in order to observe the behavior of GenMAPP with the exported database
*GeneFinder was loaded by placing a blank ''Gene'' element on the drafting board of GenMAPP and right-clicking it.
*The genes BCAL0001,BCAL0002, BCAM0005, and BCAS0105 were searched in the Gene ID box, with the Gene ID System set to OrderedLocusNames
**All genes were successfully found and reference pages with links successfully appeared
Note: All cross-referenced IDs were present for all of these sample Gene IDs. No crashing or issues at this step.

===Creating an Expression Dataset in the Expression Dataset Manager===
*The IDs in the microarray dataset were imported into GenMAPP using the new database; there existed 284 exceptions.
*The EX.txt file was opened through Excel and it was found out that the exceptions were identical to what was found with the Build 3 export.

===Running MAPPFinder===
*Protocol sourced from [[Blitvak Week 8 | the week 8 assignment]]
*The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
*"Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
*For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
*The Test criteria was selected
*The boxes corresponding to "Gene Ontology" were checked
*"Browse" button was clicked to add a name to the file that will be created
*"Run MAPPFinder" was clicked and the program was allowed to complete its analysis

Note: MAPPFinder successfully loaded and provided an output with this gene database.

==Compare Gene Database to Outside Resource==
Outside Resource: [http://beta.burkholderia.com/ Burkholderia Genome DB], [http://www.uniprot.org/uniprot/ UniProt KB]
*The strain page for J2315 was looked up: [http://beta.burkholderia.com/strain/show/146]
*7121 OrderedLocusNames were found within the exported gdb file. 6,994 entries corresponding to protein encoding genes were found in [http://www.uniprot.org/uniprot/ UniProt KB], and 7114 coding sequences were found in the [http://beta.burkholderia.com/strain/show/146 MOD. The count of 7121 genes that is represented by the exported database appears to be very similar to the values reported by external sources/databases. It is likely that all of the genes covered by UniProt appear within the database. It is not known whether all of the coding sequences covered by the MOD appear within the database as some of the coding sequences represent hypothetical protein encoding genes or functional RNA.

*Note: The IDs and counts covered by this export appear to be consistent with outside resources.
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

Blitvak Week 15

2015-12-10T23:54:07Z

Blitvak: added for 12/10

==12/8==
7121 - orf

6993 - uniprot

128 - diff

205 - query for duplicates

77 - number of records

128 + 77 = 205

java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"

*<code>select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid;</code>

*<code>select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid having count(value) > 1;</code>

*<code>select sum(dupe_count) from (select genetype_name_hjid, count(value) as dupe_count
from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?'
group by genetype_name_hjid having count(value) > 1 order by count(value) desc) as dupe_tally;</code>

*<code>select genetype_name_hjid, count(value) as dupe_count
from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?'
group by genetype_name_hjid having count(value) > 1 order by count(value) desc;</code>

*<code>select *
from genenametype where genetype_name_hjid = 66138;</code>

*<code>select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid having count(value) > 1 order by count(value) desc;</code>

==For Presentation==

Final command for MATCH:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
*select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?';
*

Blitvak Week 15

2015-12-08T23:30:16Z

Blitvak: added info regarding the duplicate counts

==12/8==
7121 - orf

6993 - uniprot

128 - diff

205 - query for duplicates

77 - number of records

128 + 77 = 205

*<code>select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid;</code>

*<code>select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid having count(value) > 1;</code>

*<code>select sum(dupe_count) from (select genetype_name_hjid, count(value) as dupe_count
from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?'
group by genetype_name_hjid having count(value) > 1 order by count(value) desc) as dupe_tally;</code>

*<code>select genetype_name_hjid, count(value) as dupe_count
from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?'
group by genetype_name_hjid having count(value) > 1 order by count(value) desc;</code>

*<code>select *
from genenametype where genetype_name_hjid = 66138;</code>

*<code>select genetype_name_hjid, count(value) from genenametype where type = 'ORF' and value ~ 'p?BCA[LMS]?[0-9][0-9][0-9][Aa]?[0-9]?[A-Z,a-z]?' group by genetype_name_hjid having count(value) > 1 order by count(value) desc;</code>

GENialOMICS

2015-12-08T07:34:40Z

Blitvak: added files for week 14

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Create testing reports on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
* It was found that the gene names of interest existed under the label of "ORF" in the XML file (explained why they weren't captured by GenMAPP builder)
* Created a streamlined general expression that can capture all of the IDs of interest; shared with Anu, assisted in the creation of Build 3/4 of the modified genmapp builder
* Found that XMLPipeDB Match gave 6 extra counts using the general expression for the IDs; excel MATCH command analysis was conducted and the discrepant IDs were found
* Completed gene database test reports for builds 2, 3, and 4
|
''Work with Kevin Wyllie''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|
''Work with Veronica Pacheco''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Build 2 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Build 4 Export Testing Report]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:GEN BL14 20151201.zip|Files from work done on 12/01/15]]
*[[Media:GEN BL14 20151203.zip|Files from work done on 12/03/15]]
*[[Media:GEN BL14 20151204.zip|Files from work done on 12/04/15]]
*[[Media:GEN BL14 20151207.zip|Files from work done on 12/07/15]]
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
| see Kevin's section -->
|
*[[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Final processed/formatted microarray data spreadsheet]]
*[[media:For_genMAPP_KWVP20151205.txt|txt form of the forGenMAPP sheet]]
*[[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

==Reflections==
===Anu Varshneya===
*What worked?
**In general, I think our group worked very well together! I think we are all motivated to get this project done well, and are communicating well with each other regarding our progress.
*What didn't work?
**I think for the most part we did a great job. I think the only ideas I have moving forward is a little bit more planning in regards to how we plan to attack the writing and presentation portion of the project. I am not concerned about us getting it done on time and with good quality, just that we create a plan of attack soon so that everyone is on the same page. :)
*What will I do next to fix what didn't work?
**Though nothing has not worked, I think we will just talk tomorrow about how we want to approach the writing and the presentation and set up some group work times.

===Kevin Wyllie===
# What worked?
#* Our initial GenMAPP import worked! 284 errors, which, out of 7251, does sound so bad to me!
# What didn't work?
#* Maybe this isn't actually an example of something not working, but our calculated fold changes were quite different (much lower in magnitude) from those reported in Van Acker et al's paper. However, they had the same directions and generally saw the same relative trends (ie the relatively higher fold changes in the paper were among the higher in our data). Also, very few of the genes they considered significant (with their super-lenient criteria that results in 30% of the genes seeing significant changes) were significant by our criteria.
# What will I do next to fix what didn't work?
#* We just need to triple/quadruple check that our data processing protocol is legitimate. Other than that, there's not much we can do in terms of fold changes. And for statistical significance, we potentially should reconsider heightening our BH P-value threshold above 0.05, as currently we're only considering about 8% of the genes to see a significant change. But maybe this is not too low of a number.

===Veronica Pacheco===
#What worked?
#* Right off the bat, our first run through GenMAPP worked. As expected, there were exceptions and it generated an EX.txt file. There were 284 errors. We then handed over the file to Brandon so we shall see if the number of errors can decrease.
#What didn't work?
#*What we initially thought didn't work or didn't seem correct was that fact that our values for fold changes were much smaller than the values reported in Van Acker et al's paper. The direction, for the majority, aligned with what was reported however the concrete values had large differences.
#What will I do next to fix what didn't work?
#*Initially, we went to make sure our methods were correct. We traced back our steps and made sure the calculations were done correctly.After we double checked, we then sought help from Dr. Dahlquist and Dr. Donisio. From their response, it seems we should go over it one more time and if there is no source of error on our part, we continue the project with our fold changes.

===Brandon Litvak===
#What worked?
#*I think a lot of things worked this week. Team work and communication was a great help in getting the bulk of this week's work done. The initial exported database was not working as planned and, as a team, we discovered that the reason had to do with the fact that GenMAPP builder was utilizing the wrong type of gene name; this knowledge allowed us to create builds this week that happened to work fairly well. These new builds covered the gene names of interest and led to a relatively small amount errors in GenMAPP. I think that, above all, the thing that worked best this week was my team. We were able to communicate and collaborate very well.
#What didn't work?
#*At the present moment, I can't really think of things that really did not work. With respect to the gene database project for J2315, everything appears to be on track; I would say that the major problems that were encountered in Week 14 were resolved. I feel that all of the major work for the project is complete; all that remains, is to synthesize the work done in a paper and presentation. As a group, we did get little work done on the final deliverables (which should be the focus, for this week) but we did get a lot of valuable work done for the project. We haven't managed to plan much regarding the final deliverables, either (but this is a minor issue).
#What will I do next to fix what didn't work?
#*We will need to meet as a group and discuss the state of the project. I personally feel really good about the work so far and it would be helpful to hear, with the bulk of the work done, how everyone else feels. Additionally, I think that we will need to plan out our approach for the final project as soon as possible. Once we have discussed the project and made a plan, I think that we should set aside some time to work on the group project, as a team. I will check in with the group members on Tuesday, share my major findings for the week, and discuss future courses of action (regarding the last bits of the project).

=Deliverables=
=== Group Deliverables ===

* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

File:GEN BL14 20151203.zip

2015-12-08T07:34:24Z

Blitvak: for 12/03/15, week 14 files, compressed

for 12/03/15, week 14 files, compressed

File:GEN BL14 20151204.zip

2015-12-08T07:34:08Z

Blitvak: for 12/04/15, week 14 files, compressed

for 12/04/15, week 14 files, compressed

File:GEN BL14 20151207.zip

2015-12-08T07:30:41Z

Blitvak: for 12/07/15, week 14 files, compressed

for 12/07/15, week 14 files, compressed

File:GEN BL14 20151201.zip

2015-12-08T07:26:52Z

Blitvak: part 1 of week 14 files, compressed

part 1 of week 14 files, compressed

GENialOMICS

2015-12-08T07:23:48Z

Blitvak: added links to the testing reports for week 14

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Create testing reports on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
* It was found that the gene names of interest existed under the label of "ORF" in the XML file (explained why they weren't captured by GenMAPP builder)
* Created a streamlined general expression that can capture all of the IDs of interest; shared with Anu, assisted in the creation of Build 3/4 of the modified genmapp builder
* Found that XMLPipeDB Match gave 6 extra counts using the general expression for the IDs; excel MATCH command analysis was conducted and the discrepant IDs were found
* Completed gene database test reports for builds 2, 3, and 4
|
''Work with Kevin Wyllie''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|
''Work with Veronica Pacheco''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Build 2 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Build 4 Export Testing Report]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
| see Kevin's section -->
|
*[[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Final processed/formatted microarray data spreadsheet]]
*[[media:For_genMAPP_KWVP20151205.txt|txt form of the forGenMAPP sheet]]
*[[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

==Reflections==
===Anu Varshneya===
*What worked?
**In general, I think our group worked very well together! I think we are all motivated to get this project done well, and are communicating well with each other regarding our progress.
*What didn't work?
**I think for the most part we did a great job. I think the only ideas I have moving forward is a little bit more planning in regards to how we plan to attack the writing and presentation portion of the project. I am not concerned about us getting it done on time and with good quality, just that we create a plan of attack soon so that everyone is on the same page. :)
*What will I do next to fix what didn't work?
**Though nothing has not worked, I think we will just talk tomorrow about how we want to approach the writing and the presentation and set up some group work times.

===Kevin Wyllie===
# What worked?
#* Our initial GenMAPP import worked! 284 errors, which, out of 7251, does sound so bad to me!
# What didn't work?
#* Maybe this isn't actually an example of something not working, but our calculated fold changes were quite different (much lower in magnitude) from those reported in Van Acker et al's paper. However, they had the same directions and generally saw the same relative trends (ie the relatively higher fold changes in the paper were among the higher in our data). Also, very few of the genes they considered significant (with their super-lenient criteria that results in 30% of the genes seeing significant changes) were significant by our criteria.
# What will I do next to fix what didn't work?
#* We just need to triple/quadruple check that our data processing protocol is legitimate. Other than that, there's not much we can do in terms of fold changes. And for statistical significance, we potentially should reconsider heightening our BH P-value threshold above 0.05, as currently we're only considering about 8% of the genes to see a significant change. But maybe this is not too low of a number.

===Veronica Pacheco===
#What worked?
#* Right off the bat, our first run through GenMAPP worked. As expected, there were exceptions and it generated an EX.txt file. There were 284 errors. We then handed over the file to Brandon so we shall see if the number of errors can decrease.
#What didn't work?
#*What we initially thought didn't work or didn't seem correct was that fact that our values for fold changes were much smaller than the values reported in Van Acker et al's paper. The direction, for the majority, aligned with what was reported however the concrete values had large differences.
#What will I do next to fix what didn't work?
#*Initially, we went to make sure our methods were correct. We traced back our steps and made sure the calculations were done correctly.After we double checked, we then sought help from Dr. Dahlquist and Dr. Donisio. From their response, it seems we should go over it one more time and if there is no source of error on our part, we continue the project with our fold changes.

===Brandon Litvak===
#What worked?
#*I think a lot of things worked this week. Team work and communication was a great help in getting the bulk of this week's work done. The initial exported database was not working as planned and, as a team, we discovered that the reason had to do with the fact that GenMAPP builder was utilizing the wrong type of gene name; this knowledge allowed us to create builds this week that happened to work fairly well. These new builds covered the gene names of interest and led to a relatively small amount errors in GenMAPP. I think that, above all, the thing that worked best this week was my team. We were able to communicate and collaborate very well.
#What didn't work?
#*At the present moment, I can't really think of things that really did not work. With respect to the gene database project for J2315, everything appears to be on track; I would say that the major problems that were encountered in Week 14 were resolved. I feel that all of the major work for the project is complete; all that remains, is to synthesize the work done in a paper and presentation. As a group, we did get little work done on the final deliverables (which should be the focus, for this week) but we did get a lot of valuable work done for the project. We haven't managed to plan much regarding the final deliverables, either (but this is a minor issue).
#What will I do next to fix what didn't work?
#*We will need to meet as a group and discuss the state of the project. I personally feel really good about the work so far and it would be helpful to hear, with the bulk of the work done, how everyone else feels. Additionally, I think that we will need to plan out our approach for the final project as soon as possible. Once we have discussed the project and made a plan, I think that we should set aside some time to work on the group project, as a team. I will check in with the group members on Tuesday, share my major findings for the week, and discuss future courses of action (regarding the last bits of the project).

=Deliverables=
=== Group Deliverables ===

* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

GENialOMICS

2015-12-08T07:21:04Z

Blitvak: added reflection

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Create testing reports on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
* It was found that the gene names of interest existed under the label of "ORF" in the XML file (explained why they weren't captured by GenMAPP builder)
* Created a streamlined general expression that can capture all of the IDs of interest; shared with Anu, assisted in the creation of Build 3/4 of the modified genmapp builder
* Found that XMLPipeDB Match gave 6 extra counts using the general expression for the IDs; excel MATCH command analysis was conducted and the discrepant IDs were found
* Completed gene database test reports for builds 2, 3, and 4
|
''Work with Kevin Wyllie''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|
''Work with Veronica Pacheco''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
| see Kevin's section -->
|
*[[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Final processed/formatted microarray data spreadsheet]]
*[[media:For_genMAPP_KWVP20151205.txt|txt form of the forGenMAPP sheet]]
*[[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

==Reflections==
===Anu Varshneya===
*What worked?
**In general, I think our group worked very well together! I think we are all motivated to get this project done well, and are communicating well with each other regarding our progress.
*What didn't work?
**I think for the most part we did a great job. I think the only ideas I have moving forward is a little bit more planning in regards to how we plan to attack the writing and presentation portion of the project. I am not concerned about us getting it done on time and with good quality, just that we create a plan of attack soon so that everyone is on the same page. :)
*What will I do next to fix what didn't work?
**Though nothing has not worked, I think we will just talk tomorrow about how we want to approach the writing and the presentation and set up some group work times.

===Kevin Wyllie===
# What worked?
#* Our initial GenMAPP import worked! 284 errors, which, out of 7251, does sound so bad to me!
# What didn't work?
#* Maybe this isn't actually an example of something not working, but our calculated fold changes were quite different (much lower in magnitude) from those reported in Van Acker et al's paper. However, they had the same directions and generally saw the same relative trends (ie the relatively higher fold changes in the paper were among the higher in our data). Also, very few of the genes they considered significant (with their super-lenient criteria that results in 30% of the genes seeing significant changes) were significant by our criteria.
# What will I do next to fix what didn't work?
#* We just need to triple/quadruple check that our data processing protocol is legitimate. Other than that, there's not much we can do in terms of fold changes. And for statistical significance, we potentially should reconsider heightening our BH P-value threshold above 0.05, as currently we're only considering about 8% of the genes to see a significant change. But maybe this is not too low of a number.

===Veronica Pacheco===
#What worked?
#* Right off the bat, our first run through GenMAPP worked. As expected, there were exceptions and it generated an EX.txt file. There were 284 errors. We then handed over the file to Brandon so we shall see if the number of errors can decrease.
#What didn't work?
#*What we initially thought didn't work or didn't seem correct was that fact that our values for fold changes were much smaller than the values reported in Van Acker et al's paper. The direction, for the majority, aligned with what was reported however the concrete values had large differences.
#What will I do next to fix what didn't work?
#*Initially, we went to make sure our methods were correct. We traced back our steps and made sure the calculations were done correctly.After we double checked, we then sought help from Dr. Dahlquist and Dr. Donisio. From their response, it seems we should go over it one more time and if there is no source of error on our part, we continue the project with our fold changes.

===Brandon Litvak===
#What worked?
#*I think a lot of things worked this week. Team work and communication was a great help in getting the bulk of this week's work done. The initial exported database was not working as planned and, as a team, we discovered that the reason had to do with the fact that GenMAPP builder was utilizing the wrong type of gene name; this knowledge allowed us to create builds this week that happened to work fairly well. These new builds covered the gene names of interest and led to a relatively small amount errors in GenMAPP. I think that, above all, the thing that worked best this week was my team. We were able to communicate and collaborate very well.
#What didn't work?
#*At the present moment, I can't really think of things that really did not work. With respect to the gene database project for J2315, everything appears to be on track; I would say that the major problems that were encountered in Week 14 were resolved. I feel that all of the major work for the project is complete; all that remains, is to synthesize the work done in a paper and presentation. As a group, we did get little work done on the final deliverables (which should be the focus, for this week) but we did get a lot of valuable work done for the project. We haven't managed to plan much regarding the final deliverables, either (but this is a minor issue).
#What will I do next to fix what didn't work?
#*We will need to meet as a group and discuss the state of the project. I personally feel really good about the work so far and it would be helpful to hear, with the bulk of the work done, how everyone else feels. Additionally, I think that we will need to plan out our approach for the final project as soon as possible. Once we have discussed the project and made a plan, I think that we should set aside some time to work on the group project, as a team. I will check in with the group members on Tuesday, share my major findings for the week, and discuss future courses of action (regarding the last bits of the project).

=Deliverables=
=== Group Deliverables ===

* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

GENialOMICS

2015-12-08T06:55:12Z

Blitvak: added progress for week 14

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Create testing reports on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
* It was found that the gene names of interest existed under the label of "ORF" in the XML file (explained why they weren't captured by GenMAPP builder)
* Created a streamlined general expression that can capture all of the IDs of interest; shared with Anu, assisted in the creation of Build 3/4 of the modified genmapp builder
* Found that XMLPipeDB Match gave 6 extra counts using the general expression for the IDs; excel MATCH command analysis was conducted and the discrepant IDs were found
* Completed gene database test reports for builds 2, 3, and 4
|
''Work with Kevin Wyllie''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|
''Work with Veronica Pacheco''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
| see Kevin's section -->
|
*[[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Final processed/formatted microarray data spreadsheet]]
*[[media:For_genMAPP_KWVP20151205.txt|txt form of the forGenMAPP sheet]]
*[[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

==Reflections==
===Anu Varshneya===
*What worked?
**In general, I think our group worked very well together! I think we are all motivated to get this project done well, and are communicating well with each other regarding our progress.
*What didn't work?
**I think for the most part we did a great job. I think the only ideas I have moving forward is a little bit more planning in regards to how we plan to attack the writing and presentation portion of the project. I am not concerned about us getting it done on time and with good quality, just that we create a plan of attack soon so that everyone is on the same page. :)
*What will I do next to fix what didn't work?
**Though nothing has not worked, I think we will just talk tomorrow about how we want to approach the writing and the presentation and set up some group work times.

===Kevin Wyllie===
# What worked?
#* Our initial GenMAPP import worked! 284 errors, which, out of 7251, does sound so bad to me!
# What didn't work?
#* Maybe this isn't actually an example of something not working, but our calculated fold changes were quite different (much lower in magnitude) from those reported in Van Acker et al's paper. However, they had the same directions and generally saw the same relative trends (ie the relatively higher fold changes in the paper were among the higher in our data). Also, very few of the genes they considered significant (with their super-lenient criteria that results in 30% of the genes seeing significant changes) were significant by our criteria.
# What will I do next to fix what didn't work?
#* We just need to triple/quadruple check that our data processing protocol is legitimate. Other than that, there's not much we can do in terms of fold changes. And for statistical significance, we potentially should reconsider heightening our BH P-value threshold above 0.05, as currently we're only considering about 8% of the genes to see a significant change. But maybe this is not too low of a number.

===Veronica Pacheco===
#What worked?
#* Right off the bat, our first run through GenMAPP worked. As expected, there were exceptions and it generated an EX.txt file. There were 284 errors. We then handed over the file to Brandon so we shall see if the number of errors can decrease.
#What didn't work?
#*What we initially thought didn't work or didn't seem correct was that fact that our values for fold changes were much smaller than the values reported in Van Acker et al's paper. The direction, for the majority, aligned with what was reported however the concrete values had large differences.
#What will I do next to fix what didn't work?
#*Initially, we went to make sure our methods were correct. We traced back our steps and made sure the calculations were done correctly.After we double checked, we then sought help from Dr. Dahlquist and Dr. Donisio. From their response, it seems we should go over it one more time and if there is no source of error on our part, we continue the project with our fold changes.

=Deliverables=
=== Group Deliverables ===

* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

Blitvak Week 14

2015-12-08T06:49:22Z

Blitvak: cleaned up the page

==Goals for Week 14==
*Consult with Anu to make modifications to TallyEngine/GenMAPP (share initial export results)
*Use Excel to track discrepant IDs (reference: [[Using Microsoft Excel to Compare ID Lists|Using Microsoft Excel to Compare ID Lists]])
*Conduct gene database exports for any modified versions of GenMAPP builder that are created
**Analyze any conducted exports and perform Q&A work
==Initial Export Analysis==
===Overview of Week 12 findings===
*Using XMLPipeDB Match, 7127 unique matches were found that correlated with the OrderedLocusNames IDs outlined at the end of the [[Blitvak Week 12|week 12 assignment]]
*TallyEngine reported that 337 OrderedLocusNames were present in the XML and within the PSQL database
*Using <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';</code>, it was verified that 337 OrderedLocusNames entries were present in B.cenocepacia_J2315_20151119_gmb3build5.
*By looking at the data present in the genenametype table, it was found that the OrderedLocusName data was in the format of <code>BceJ2315_#####</code>
===Steps taken for further analysis, conducted on 12/1===
*The UniProt XML file was opened , via [http://www.firstobject.com/dn_editor.htm first object XML editor], in order to investigate and verify the location/nature of the OrderedLocusName data.
*A data entry was selected and the related data was looked into:
[[File:XML_exploration_GEN_BL14_20151201.png]]
*In this entry, and in numerous others, it was noticed that only the gene name in the format of <code>BceJ2315_#####</code> was tagged as being of the "ordered locus" type. The format that was being focused upon in previous work, that of <code>p?BCA[M,S,L]###?[A,a]#[A-Z]?</code>, was labeled as being of the type "ORF". It was noticed that all entires that contained an "ordered locus" gene name also contained an "ORF" name for the same gene; most entries, additionally, lacked an "ordered locus" name and only contained an "ORF" name.
*GenMAPP builder, by default, is made to pick up and utilize the ordered locus data within the XML; it was realized that, with respect to the initial export, it was functioning properly. Since the XML data only contained 337 OrderedLocus names, only 337 made it to the database. Since 7127 matches were found, using XMLPipeDB Match, that correlated to an "ORF" name, it is assumed that most of the gene data is ignored by focusing on OrderedLocus names.
*[http://www.uniprot.org/uniprot/ UniProt KB] was referenced in order to further verify that all <code>BceJ2315_#####</code> gene names were coupled with one that was considered an "ORF" name
*A search query was conducted that consisted of <code>bcej2315 NOT gene:bca*</code>; this query, it was hoped, would show the number of gene entries that contained just an OrderedLocusName ID.
[[File:UniProt_Results_2_12.1_GEN_BL14_20151201.png]]
===Discrepant ID analysis for the Initial Export, Conducted on 12/3===
*The PSQL database for the initial export was opened up and the SQL query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';</code> was run in order to find the ORF counts within the database (since TallyEngine, with the present used build of GenMAPP builder, did not incorporate the ORF data).
**It was found that 7121 entries were within the database that corresponded to ORF data. <code>select * from genenametype where type = 'ORF';</code> was run in order to observe the data; it was found that the data within the table corresponded to the gene name format of interest (<code>p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?</code>).
*However, it was found that there was a difference in count between what was previously reported by XMLPipeDB Match and by Postgres (7127 vs. 7121, a difference of 6 entries). It was realized that Excel should be utilized in order to track down the discrepant IDs.
====Using Excel to track down discrepant IDs====
*Pg Admin III was initialized and the database that was the initial export was booted up.
*The SQL query <code>select * from genenametype where type = 'ORF' order by value</code> was utilized in order to put the data in ascending order (lower ID #s come first); the results of query was then exported in a format that Excel can read (text file).
*Using the windows command line, through XMLPipeDB match, the 7127 unique XML entries that fit the criteria of ORF gene name were exported as a text file using <code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?" < uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml > MATCHIDs_GEN_BL14_20151203</code>
*Both files were opened with Excel and the proper settings were selected so that the gene name data ended up on its own column (for the Match data, a colon was selected as the divider between columns; PSQL data was comma separated).
*The column of IDs from from the Match utility, and the one that was found in the PostgreSQL database, were put side by side in a new Excel document with no spaces between them; it was ensured that each column was in ascending order. The column of IDs from the Match utility was given the label of "MATCH IDs"; the one from the PSQL database was given the label of "IDs FROM postgreSQL''.
*Two new columns were created to the right of the ID columns, one was given the label of "MATCH: 1 to 2", the other that of "MATCH: 2 to 1". The plan, at this point, was to utilize Excel MATCH commands in order to compare the two sets of IDs with eachother; it was hoped that these commands would indicate which IDs were present in one set but not in the other.
*MATCH commands were then written the 2 MATCH columns and applied to the entirety of each MATCH column; the basic format is <code>=MATCH(VALUE TO LOOK-UP, RANGE/COLUMN WHERE THE LOOKING-UP OF A VALUE TAKES PLACE, "MATCH TYPE" [0 in this case])</code>. The purpose of these MATCH commands is to compare the two different ID lists (with each other)
*'''ALL MATCH COMMANDS:'''
Format - '''Column Label''' : <code>MATCH Command</code> (in first "cell")
'''MATCH:1 to 2''' : <code>=MATCH(A2, B$2:B$7122, 0)</code>
'''MATCH:2 to 1''' : <code>=MATCH(B2, A$2:A$7122, 0)</code>
Note: In the analysis conducted, the IDs from XMLPipeDB Match were placed in column A, and the ones from the Postgres Database were placed in column B. Values of "#N/A" appear in instances where an ID in one was set was not found in another.
*The Find function was used in Excel (via control + F) and the value '''#N/A''' was searched for ("Look In:" was set to Values). 6 instances of '''#N/A''' were found, which coheres with the difference of 6 that was found between the Match utility results and those of PSQL.
**The discrepant IDs are: bca199f, bca5253f, bca636c, bcad837b, bcal0235a, and bcal0239a
*bca199f, bca5253f, bca636c, and bcad837b were found to be a part of a sequence of letters and numbers under the label of "checksum"; these appeared to have been accidentally captured by the utilized Match command.
*bcal0235a and bcal0239a follow the previous identified gene name patterns, however, they both show up as database reference IDs (database reference to STRING, which is a database of known and predicted protein interactions; these data will be ignored as they do not fall under an entry that refers to a gene name.
**At this point, it was also realized that the commands utilized for Match and PSQL could use some adjustments with respect to the desired pattern. <code>p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]</code> was modified to <code>p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]</code>; this new pattern was tested with XMLPipe DB match using the command <code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < filename.xml</code>. The new pattern resulted in 7126 matches which eliminates one discrepant and incorrect match.
==Export of Build 2, a custom version of GenMAPP builder: conducted on 12/1==
*Anu provided a [[Media:gmbuilder-genialomics-12012015-build-2.zip|custom build of GenMAPP builder]] that included a customized species profile for J2315 as the only change.
===Creating the database "B.cenocepacia_J2315_20151201_BUILD2_genialomics" in PostgreSQL===
*Steps taken were sourced from the [[Running GenMAPP Builder | Running GenMAPP Builder page]]
*pgAdmin III was launched and a connection to the server was made. "Databases" was right clicked and select "New Database..." was chosen. The database was given a name, B.cenocepacia_J2315_20151201_gmbuilder-genialomics-20151201, and OK was clicked.
*The new database was selected and the Query Tool was launched. Open File was clicked in the Query Tool and ''gmbuilder.sql'' in the '''gmbuilder-genialomics-12012015-build-2''' folder (within the ''sql'' folder) was selected. Upon selection of that file, a query was loaded into Query Tool and it was subsequently executed by clicking the green "Execute Query" arrow
*This query populates the created database with all of its tables. In order to ensure that the query properly worked, it was checked that 167 tables existed in the database
===Data Import into ''gmbuilder-genialomics-12012015-build-2''===
*gmbuilder.bat in the gmbuilder-genialomics-12012015-build-2 folder was launched
*Under file -> configure database, the host was left as localhost, the port number was left as 5432, database name was set to ''gmbuilder-genialomics-12012015-build-2'', Username was set to postgres, Password was set to the password of the PostgreSQL database that was recently created. OK was clicked.
*File -> Import UniProt XML was selected
**The UniProt XML file that was previously extracted was chosen, open was clicked. The import process was allowed to proceed uninterrupted.
*File -> Import GO OBO-XML was selected
**The GO OBO-XML that was previously extracted was chosen, open was clicked. The import process was allowed to proceed uninterrupted.
*File -> Import GOA was selected
**The GOA file that was downloaded previously was chosen, open was clicked, and the import process was allowed to proceed uninterrupted.
===Exporting a GenMAPP Gene Database (.gdb file)===
*File -> Export to GenMAPP Gene Database was selected
*BL was typed into the Owner field. The species of interest was selected for export (''B. cenocepacia J2315'')
*Next was clicked, the create GenMAPP database file/location was selected, and the boxes for the exporting of Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms were left checked. The export process was initialized by clicking next; the windows were left open for the program to continue and finish with the export process (was estimated to take somewhere between 1-2 hrs). The database was given the name "Bc-Std_GEN_BL12_20151201.gdb".
===Database Testing Report for Build 2===
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Testing Report for Build 2 Export]]
==Exports of Build 3 and 4, custom versions of GenMAPP builder: conducted on 12/3==
*Gene database (.gdb) exports were conducted for builds 3 and 4 of the customized GenMAPP builder.
*Build 3 of GenMAPP Builder: program was modified so that the gene names will be picked up from ORF data rather than ordered locus data.
*Build 4 of GenMAPP Builder: TallyEngine code was cleaned up and code errors were fixed.
===Creating the postgres databases for builds 3 and 4 of customized GenMAPP builder===
*pgAdmin III was launched and a connection to the server was made. "Databases" was right clicked and select "New Database..." was chosen. The database, for Build 3, was given the name ''B.cenocepacia_J2315_20151203_BUILD3_genialomics'', and OK was clicked. The same procedure was repeated for Build 4; the build 4 database was given the name ''B.cenocepacia_J2315_20151204_BUILD4_genialomics''.
*The new databases were selected and the Query Tool was launched. Open File was clicked in the Query Tool and ''gmbuilder.sql'' in the ''sql'' folder of the '''gmbuilder folder''' of Build 3 and Build 4 of the customized GenMAPP builder was selected. Upon selection of that file, a query was loaded into Query Tool and it was subsequently executed by clicking the green "Execute Query" arrow
*This query populates the created databases with all of its tables. In order to ensure that the query properly worked, it was checked that 167 tables existed in the databases.
===Data Import into ''B.cenocepacia_J2315_20151203_BUILD3_genialomics'' and ''B.cenocepacia_J2315_20151204_BUILD4_genialomics''===
*gmbuilder.bat was launched in the respective GenMAPP builder folder for both of the exports involving the two custom builds.
*Under file -> configure database, the host was left as localhost, the port number was left as 5432, database name was set to ''B.cenocepacia_J2315_20151203_BUILD3_genialomics'', in the case of the third build of GenMAPP builder. The database name was set to ''B.cenocepacia_J2315_20151204_BUILD4_genialomics'' in the case of the fourth build.
*The username was set to postgres, Password was set to the password of the PostgreSQL server that was being utilized. OK was clicked.
*File -> Import UniProt XML was selected
**The UniProt XML file that was previously extracted was chosen, open was clicked. The import process was allowed to proceed uninterrupted.
*File -> Import GO OBO-XML was selected
**The GO OBO-XML that was previously extracted was chosen, open was clicked. The import process was allowed to proceed uninterrupted.
*File -> Import GOA was selected
**The GOA file that was downloaded previously was chosen, open was clicked, and the import process was allowed to proceed uninterrupted.
===Exporting a GenMAPP Gene Database (.gdb file)===
*File -> Export to GenMAPP Gene Database was selected
*BL was typed into the Owner field. The species of interest was selected for export (''B. cenocepacia J2315'')
*Next was clicked, the create GenMAPP database file/location was selected, and the boxes for the exporting of Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms were left checked. The export process was initialized by clicking next; the windows were left open for the program to continue and finish with the export process (was estimated to take somewhere between 1-2 hrs). The database for the build 3 export was given the name of "Bc-Std_GEN_Build3_20151203.gdb"; the one for the build 4 export was given the name "Bc-Std_GEN_Build4_20151204.gdb".
===Database Testing Report for Builds 3 and 4===
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Testing Report for Build 3 Export]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Testing Report for Build 4 Export]]

----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

GÉNialOMICS Gene Database Testing Report (Build 4 Export)

2015-12-08T06:48:26Z

Blitvak: added genmapp analysis (wrapped it up)

==Export Information==

Version of GenMAPP Builder: [[Media:gmbuilder-genialomics-12032015-build-4.zip|GenMAPP Builder Custom, Build 4]]

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151204_BUILD4_genialomics

UniProt XML filename: [[Media:Uniprot-taxonomy-216591 GEN BL12 20151119.zip|uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml]]
* UniProt XML version: UniProt release 2015_11 - November 11, 2015
* UniProt XML download link: [http://www.uniprot.org/uniprot/?query=organism:216591 UniProtKB link for the complete proteome of J2315]
* Time taken to import: 3.46 minutes
** Note: Time taken appears to be slightly shorter than previous exports.

GO OBO-XML filename: [[Media:Go daily-termdb GEN BL12 20151119.zip|go_daily-termdb_GEN_BL12_20151119.obo-xml]]
* GO OBO-XML version (derived from the date modified on the file, itself): ''Date Modified: 11/19/2015 2:24 AM''
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Link from GO website]
* Time taken to import: 5.05 minutes
* Time taken to process: 3.75 minutes
** Note: Time taken appears to be slightly shorter than previous exports.

GOA filename: [[Media:31277.B cepacia GEN BL12 20151119.zip | 31277.B_cepacia_GEN_BL12_20151119.goa]]
* GOA version: ''Date Modified: 11/10/15, 1:47:00 PM'' (information sourced from FTP site)
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/31277.B_cepacia.goa FTP site file]
* Time taken to import: 0.04 Minutes
** Note: No issues were found with the import of this file.

Name of .gdb file: [[Media:Bc-Std GEN Build4 20151204.zip|Bc-Std GEN Build4 20151204.gdb]]
* Time taken to export: 11 hours 6 minutes
** Start time: 7:51 am
** End time: 6:57 pm
** Note: File was exported without any major issues, however, the export appeared to take significantly longer than the previous exports. It is likely that the export took so long because the workstation had, for some period of time, entered a "sleep" mode (export was delayed, as the computer had to be taken off of "sleep").

==Using TallyEngine==
*PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151204_BUILD4_genialomics was left running
*GenMAPP builder was booted and ''Run XML and Database Tallies for UniProt and GO'' was selected under the ''Tallies'' menu item; the UniProt XML and GO files that were imported were chosen
*'''Results of TallyEngine:'''
*[[File:Build4tallyengine_results_GEN_BL14_20151204.png]]
**Note: These results differ significantly from what was found in previous exports. The 337 Ordered Locus gene names are now distinct from the 7121 ORF gene names (and are represented, as such, by TallyEngine). All of the counts related to external references (like UniProt) remain the same. The major and crucial change is the inclusion and representation of the ORF data.

==Using XMLPipeDB match to Validate the XML Results from the TallyEngine==
*The Windows command line was launched (cmd.exe)
*This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
**NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory). The results were identical to what was found in the [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 export]].
*[[File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png]]
*7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?
*These results vary slightly from what was found by TallyEngine due to the presence of 5 discrepant IDs (which were identified in the [[Blitvak Week 14|Week 14 assignment]]). Barring those 5 IDs, the results by XMLPipeDB Match line up with what TallyEngine reports (since the Match query grabs ORF data).

==Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==
*''pgAdmin III'' was booted and all of the necessary connections were made
*In ''pgAdmin III'', the query <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]';</code> was issued via the SQL Query menu in order to validate the TallyEngine count for "Ordered Locus" for the PSQL database.
**337 unique matches were found in ''pgAdmin III'' (postgres database results). This lines up with what was found in TallyEngine.
*Additionally, the query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';</code> was run via SQL in order to the verify the ORF counts
**7121 counts were found which is identical to what was found through XMLPipeDB match (ignoring the discrepant IDs) and to what was reported by TallyEngine (for the ORF data).
*Are your results the same as reported by the TallyEngine? Why or why not?
**The results are now the same as what was reported by TallyEngine; this is due to the fact that the most recent build incorporated code fixes that allowed GenMAPP builder, and TallyEngine, to properly include the ORF data in their analysis/work.

==OriginalRowCounts Comparison==
*The newly created J2315 .gdb file (Bc-Std_GEN_BL12_20151201.gdb) was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, [http://www.alexnolan.net/software/mdb_viewer_plus.htm MDB Viewer Plus] was utilized.
*Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
*'''OriginalRowCounts for Build 4 export of J2315'''
**[[File:build4OriginalRowCounts_GEN_BL14_20151204.png]]
*It was decided that a good reference or "benchmark" would be the database that was created using Build 3 of the customized GenMAPP builder; comparing the two should allow me to see if there was any difference in the imported data (and any difference in the functionality of GenMAPP builder).
*Benchmark .gdb file: [[Media:Bc-Std_GEN_Build3_20151203.zip|compressed Bc-Std_GEN_Build3_20151203.gdb]]
*'''OriginalRowCounts for the Build 3 export of J2315'''
**[[File:Build3OriginalRowCounts GEN BL14 20151203.png]]
Note: It was noticed that the OriginalRowCounts table in this export is identical to the one that came from the Build 3 export. This seems to suggests that the only fundamental difference between the two builds of GenMAPP builder lies with TallyEngine (this makes sense, considering that build 4 focused upon fixing problems with TallyEngine and improper code).

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
**Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
*In the UniProt table, like before, it is apparent that only gene names of the type "ordered locus" are represented (no signs of gene names that begin with something like "BCA"). The RefSeq table appears to not have any problems. The ordered locus names table, like in Build 3, only reflects gene names in the form of <code>p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]</code>; it appears that the "ORF" data replaced the "ordered locus" gene names in this table (these IDs appear to be in the correct and common form).

Note: Visually, no changes seem apparent between the Build 3 and Build 4 export.

==.gdb Use in GenMAPP==
*Some of the protocol from [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] was used as a reference for this portion of the assignment
*''Bc-Std_GEN_Build4_20151204.gdb'' was placed within the Gene Databases folder of the GenMAPP directory (the folder is within the GenMAPP 2 Data folder)
*GenMAPP (Version 2.1) was launched
*The new gene database was loaded by going into ''Data > Choose Gene Database''
*The tab deliminated GenMAPP formatted [[Media:For_genMAPP_KWVP20151205.txt|data]] sourced from the microarray paper was loaded into GenMAPP through ''Data > Expression Dataset Manager > Expression Datasets > New Dataset > ''GenMAPP formatted microarray data_GEN_B14_20151207.txt''

Note: There were no glaring issues with loading the files into GenMAPP (no crashes). However, this gene database led to the detection of 284 errors in the loaded raw data; this error count is identical to what was seen with the build 3 export.

===Putting a gene on the MAPP using the GeneFinder window===
*A test expression data-set was created in order to observe the behavior of GenMAPP with the exported database
*GeneFinder was loaded by placing a blank ''Gene'' element on the drafting board of GenMAPP and right-clicking it.
*The genes BCAL0001,BCAL0002, BCAM0005, and BCAS0105 were searched in the Gene ID box, with the Gene ID System set to OrderedLocusNames
**All genes were successfully found and reference pages with links successfully appeared
Note: All cross-referenced IDs were present for all of these sample Gene IDs. No crashing or issues at this step.

===Creating an Expression Dataset in the Expression Dataset Manager===
*The IDs in the microarray dataset were imported into GenMAPP using the new database; there existed 284 exceptions.
*The EX.txt file was opened through Excel and it was found out that the exceptions were identical to what was found with the Build 3 export.

===Running MAPPFinder===
*Protocol sourced from [[Blitvak Week 8 | the week 8 assignment]]
*The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
*"Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
*For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
*The Test criteria was selected
*The boxes corresponding to "Gene Ontology" were checked
*"Browse" button was clicked to add a name to the file that will be created
*"Run MAPPFinder" was clicked and the program was allowed to complete its analysis

Note: MAPPFinder successfully loaded and provided an output with this gene database.

==Compare Gene Database to Outside Resource==
Outside Resource: [http://beta.burkholderia.com/ Burkholderia Genome DB], [http://www.uniprot.org/uniprot/ UniProt KB]
*The strain page for J2315 was looked up: [http://beta.burkholderia.com/strain/show/146]
*7121 OrderedLocusNames were found within the exported gdb file. 6,994 entries corresponding to protein encoding genes were found in [http://www.uniprot.org/uniprot/ UniProt KB], and 7114 coding sequences were found in the [http://beta.burkholderia.com/strain/show/146 MOD. The count of 7121 genes that is represented by the exported database appears to be very similar to the values reported by external sources/databases. It is likely that all of the genes covered by UniProt appear within the database. It is not known whether all of the coding sequences covered by the MOD appear within the database as some of the coding sequences represent hypothetical protein encoding genes or functional RNA.

*Note: The IDs and counts covered by this export appear to be consistent with outside resources.
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

GÉNialOMICS Gene Database Testing Report (Build 3 Export)

2015-12-08T06:41:55Z

Blitvak: added gennmapp work

==Export Information==

Version of GenMAPP Builder: [[Media:gmbuilder-genialomics-12032015-build-3.zip|GenMAPP Builder Custom, Build 3]]

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151203_BUILD3_genialomics

UniProt XML filename: [[Media:Uniprot-taxonomy-216591 GEN BL12 20151119.zip|uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml]]
* UniProt XML version: UniProt release 2015_11 - November 11, 2015
* UniProt XML download link: [http://www.uniprot.org/uniprot/?query=organism:216591 UniProtKB link for the complete proteome of J2315]
* Time taken to import: 3.99 minutes
** Note: No issues were found with the import of this file.

GO OBO-XML filename: [[Media:Go daily-termdb GEN BL12 20151119.zip|go_daily-termdb_GEN_BL12_20151119.obo-xml]]
* GO OBO-XML version (derived from the date modified on the file, itself): ''Date Modified: 11/19/2015 2:24 AM''
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Link from GO website]
* Time taken to import: 5.77 minutes
* Time taken to process: 4.06 minutes
** Note: No issues were found with the import of this file.

GOA filename: [[Media:31277.B cepacia GEN BL12 20151119.zip | 31277.B_cepacia_GEN_BL12_20151119.goa]]
* GOA version: ''Date Modified: 11/10/15, 1:47:00 PM'' (information sourced from FTP site)
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/31277.B_cepacia.goa FTP site file]
* Time taken to import: 0.05 Minutes
** Note: No issues were found with the import of this file.

Name of .gdb file: [[Media:Bc-Std_GEN_Build3_20151203.zip|Bc-Std_GEN_Build3_20151203.gdb]]
* Time taken to export: 4 hours 37 minutes
** Start time: 7.24 pm
** End time: 12:01 am
** Note: File was exported without any major issues, however, the export appeared to take even longer than the one conducted for the [[GÉNialOMICS Gene Database Testing Report (Initial Export) | initial export]]. This export took a little over 2 hours longer than the previous export and it is suspected that this difference is due to the fact that work was being done on the computer while the export was taking place.

==Using TallyEngine==
*PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151203_BUILD3_genialomics was left running
*GenMAPP builder was booted and ''Run XML and Database Tallies for UniProt and GO'' was selected under the ''Tallies'' menu item; the UniProt XML and GO files that were imported were chosen
*'''Results of TallyEngine:'''
*[[File:Build3tallyengine results GEN BL14 20151203.png]]
**Note: These results are identical to what was found in the initial export and in the export involving the second build of a modified genmapp builder (see [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 testing report]]). Since GenMAPP builder was modified, for Build 3, so that the gene names will be collected by the program from the ORF data rather than the ordered locus data, it appears that there exists some errors in the program that are preventing it from properly collecting and taking into account the "ORF" data that resides in the XML file.

==Using XMLPipeDB match to Validate the XML Results from the TallyEngine==
*The Windows command line was launched (cmd.exe)
*This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
**NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory). The results were identical to what was found in the [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 export]].
*[[File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png]]
*7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?
*These results are very different from what was found through TallyEngine because these TallyEngine results, as mentioned in the [[Blitvak Week 14|Week 14 assignment]], only represent the gene name data in the XML that is tagged as being "ordered locus". The match command was found to reflect the data that is found under the "ORF" tag; the ORF and ordered locus counts are both very different, and this is reflected in the difference between TallyEngine and XMLPipeDB match with respect to the gene name counts.

==Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==
*''pgAdmin III'' was booted and all of the necessary connections were made
*It was realized that the ''gene/name'' tags in the XML file end up in the ''genenametype'' table (source: [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways |the wiki page regarding database quality analysis]]
*In ''pgAdmin III'', the query <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]';</code> was issued via the SQL Query menu in order to validate the TallyEngine count for "orderedlocusnames" for the PSQL database.
**337 unique matches were found in ''pgAdmin III'' (postgres database results). This lines up with what was found in TallyEngine.
*Once again, the query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';</code> was run via SQL in order to the verify the ORF counts (compared to the results that were found using XMLPipeDB match, see [[Blitvak Week 14|the week 14 assignment]].
**7121 counts were found which lines up with what was found through XMLPipeDB match and through an analysis of the XML file
*At this point, it was once again assumed that the data in the genenametype table of the PSQL database is identical to what was within the same table in the initial export PSQL database.
*Are your results the same as reported by the TallyEngine? Why or why not?
**The "ordered locus" results are the same as what was reported by TallyEngine since both are focusing on the same set of data. TallyEngine was modified to focus upon the "ORF" data, however, it appears that there are issues that are preventing it from doing so.

==OriginalRowCounts Comparison==
*The newly created J2315 .gdb file (Bc-Std_GEN_BL12_20151201.gdb) was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, [http://www.alexnolan.net/software/mdb_viewer_plus.htm MDB Viewer Plus] was utilized.
*Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
*'''OriginalRowCounts for Build 3 export of J2315'''
**[[File:Build3OriginalRowCounts GEN BL14 20151203.png]]
*It was decided that a good reference or "benchmark" would be the database that was created using Build 2 of the customized GenMAPP builder; comparing the two should allow me to see if there was any difference in the imported data.
*Benchmark .gdb file: [[Media:Bc-Std Build2 GEN BL14 20151201.zip|compressed Bc-Std_Build2_GEN_BL14_20151201.gdb]]
*'''OriginalRowCounts for the Build 2 export of J2315'''
**[[File:Build2ExportOriginalRowCounts GEN BL14 20151201.png]]
Note: It was noticed that the OriginalRowCounts table in this export are mostly identical to the one found through the Build 2 export. However, it was noticed that there existed differences in the OrderedLocusNames table between the two exports. It was found that the recent export, the Build 3 export, contained 7121 rows in the OrderedLocusNames table (which indicates 7121 entries, which is the same as the number of ORF gene names in the XML), while the last export, the Build 2 export, contained 337 rows in the OrderedLocusNames tables. The fact that the build 3 export how shows 7121 entries in that table is indicative of the fact that this modified GenMAPP builder (build 3) is now focusing on the ORF data; it appears, however, that it is now labeling the "ORF" data as being OrderedLocusNames instead of the "ordered locus" data. The observation in the OriginalRowCounts table does not completely mesh with what was found earlier in the PSQL database. In the PSQL database, it was found that the OrderedLocusName data was still the "ordered locus" gene names that reside in the XML (and the "ORF" data are the 7121 gene names of interest). In conclusion, it feels that there are some issues with TallyEngine and GenMAPP builder that are leading to some issues (such as TallyEngine not reporting the ORF data).

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
**Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
*In the UniProt table, like before, it is apparent that only gene names of the type "ordered locus" are represented (no signs of gene names that begin with something like "BCA"). The RefSeq table appears to not have any problems. The ordered locus names table, now, only reflects gene names in the form of <code>p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]</code>; it appears that the "ORF" data replaced the "ordered locus" gene names in this table (these IDs appear to be in the correct and common form).

Note: The modifications to GenMAPP builder appear to have changed some of the data within the tables of the gene database (ORF gene names replacing "ordered locus" gene names, with respect to the OrderedLocusNames table).

==.gdb Use in GenMAPP==
*Some of the protocol from [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] was used as a reference for this portion of the assignment
*''Bc-Std_GEN_Build3_20151203.gdb'' was placed within the Gene Databases folder of the GenMAPP directory (the folder is within the GenMAPP 2 Data folder)
*GenMAPP (Version 2.1) was launched
*The new gene database was loaded by going into ''Data > Choose Gene Database''
*The tab deliminated GenMAPP formatted [[Media:For_genMAPP_KWVP20151205.txt|data]] sourced from the microarray paper was loaded into GenMAPP through ''Data > Expression Dataset Manager > Expression Datasets > New Dataset > ''GenMAPP formatted microarray data_GEN_B14_20151207.txt''

Note: There were no glaring issues with loading the files into GenMAPP (no crashes). However, this gene database led to the detection of 284 errors in the loaded raw data; this new error count is significantly smaller than the 7251 errors that were detected using the previously exported database. Since this export incorporates the ORF data, it appears that the majority of the genes present in the microarray dataset are covered.

===Putting a gene on the MAPP using the GeneFinder window===
*A test expression data-set was created in order to observe the behavior of GenMAPP with the exported database
*GeneFinder was loaded by placing a blank ''Gene'' element on the drafting board of GenMAPP and right-clicking it.
*The genes BCAL0001,BCAL0002, BCAM0005, and BCAS0105 were searched in the Gene ID box, with the Gene ID System set to OrderedLocusNames
**All genes were successfully found and reference pages with links successfully appeared
Note: All cross-referenced IDs were present for all of these sample Gene IDs. No crashing or issues at this step.

===Creating an Expression Dataset in the Expression Dataset Manager===
*The IDs in the microarray dataset were imported into GenMAPP using the new database; there existed 284 exceptions.
*The EX.txt file was opened through Excel and it was found out that the error code for all of the exceptions was: ''Gene not found in OrderedLocusNames or any related system.'' The Gene IDs were sorted by error and the problematic IDs were analyzed. It was found, through the find function, that 101 of the exceptions were due to alterations in the usual formatting of the gene name (these gene names contained underscores, Js, and numbers). The rest of the exceptions, it was found (via UniProt KB searches), represented genes that are not present in the UniProt database; the majority of these gene exceptions, additionally, were not present in the Burkholderia model organism database. The ones that were present were all found to be non-coding DNA or DNA with no known associated protein.

===Running MAPPFinder===
*Protocol sourced from [[Blitvak Week 8 | the week 8 assignment]]
*The MAPPFinder program was launched within GenMAPP (Tools > MAPPFinder)
*"Calculate New Results" was clicked in the window that appeared by launching MAPPFinder
*For "Find File", the Expression Dataset file (with a .gex extension) was selected, and OK was clicked
*The Test criteria was selected
*The boxes corresponding to "Gene Ontology" were checked
*"Browse" button was clicked to add a name to the file that will be created
*"Run MAPPFinder" was clicked and the program was allowed to complete its analysis

Note: MAPPFinder successfully loaded and provided an output with this gene database.

==Compare Gene Database to Outside Resource==
Outside Resource: [http://beta.burkholderia.com/ Burkholderia Genome DB], [http://www.uniprot.org/uniprot/ UniProt KB]
*The strain page for J2315 was looked up: [http://beta.burkholderia.com/strain/show/146]
*7121 OrderedLocusNames were found within the exported gdb file. 6,994 entries corresponding to protein encoding genes were found in [http://www.uniprot.org/uniprot/ UniProt KB], and 7114 coding sequences were found in the [http://beta.burkholderia.com/strain/show/146 MOD. It is apparent that the count of 7121 (ORF data) is much closer to what is present in outside resources than the one of 337 ("ordered locus" data). The differences in count between UniProt and the gdb and MOD could be the result of the fact that UniProt only covers genes that code for protein (some of the coding sequences present in the MOD, or within the gdb, could be responsible for functional RNA, which are not covered by UniProt).

*Note: The exported database now seems more in-line with what is to be expected of the genome of ''B. cenocepacia''; the current OrderedLocusName counts (which actually represents ORF counts) seem very close to the counts expressed by the MOD and by UniProt.
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

GÉNialOMICS Gene Database Testing Report (Build 2 Export)

2015-12-08T06:16:58Z

Blitvak: added genmapp analysis

==Export Information==

Version of GenMAPP Builder: [[Media:gmbuilder-genialomics-12012015-build-2.zip|GenMAPP Builder Custom, Build 2]]

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151201_BUILD2_genialomics

UniProt XML filename: [[Media:Uniprot-taxonomy-216591 GEN BL12 20151119.zip|uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml]]
* UniProt XML version: UniProt release 2015_11 - November 11, 2015
* UniProt XML download link: [http://www.uniprot.org/uniprot/?query=organism:216591 UniProtKB link for the complete proteome of J2315]
* Time taken to import: 3.72 minutes
** Note: No issues were found with the import of this file.

GO OBO-XML filename: [[Media:Go daily-termdb GEN BL12 20151119.zip|go_daily-termdb_GEN_BL12_20151119.obo-xml]]
* GO OBO-XML version (derived from the date modified on the file, itself): ''Date Modified: 11/19/2015 2:24 AM''
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Link from GO website]
* Time taken to import: 5.25 minutes
* Time taken to process: 3.91 minutes
** Note: No issues were found with the import of this file.

GOA filename: [[Media:31277.B cepacia GEN BL12 20151119.zip | 31277.B_cepacia_GEN_BL12_20151119.goa]]
* GOA version: ''Date Modified: 11/10/15, 1:47:00 PM'' (information sourced from FTP site)
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/31277.B_cepacia.goa FTP site file]
* Time taken to import: 0.04 Minutes
** Note: No issues were found with the import of this file.

Name of .gdb file: [[Media:Bc-Std Build2 GEN BL14 20151201.zip|Bc-Std_Build2_GEN_BL14_20151201.gdb]]
* Time taken to export: 4 hours 22 minutes
** Start time: 10:27 pm
** End time: 2:49 am
** Note: File was exported without any major issues, however, the export appeared to take even longer than the one conducted for the [[GÉNialOMICS Gene Database Testing Report (Initial Export) | initial export]]. This export took almost 2 more hours than the previous export and it is suspected that this difference is due to the fact that work was being done on the computer while the export was taking place.

==Using TallyEngine==
*PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151201_BUILD2_genialomics was left running
*GenMAPP builder was booted and ''Run XML and Database Tallies for UniProt and GO'' was selected under the ''Tallies'' menu item; the UniProt XML and GO files that were imported were chosen
*'''Results of TallyEngine:'''
*[[File:Tallyengine results GEN BL14 20151201.png]]
**Note: These results are identical to what was found through the initial, dry, export of the gene database. This isn't all too surprising since the only difference in build 2 is that there exists a customized species profile for J2315. The data that GenMAPP builder collects from the XML is still the same since no major coding modifications were done on the program. As outlined in the [[Blitvak Week 14|Week 14 assignment]], future versions of GenMAPP builder should collect the data related to the "ORF" gene name data, rather than the "ordered locus" gene name data that is collected by default. It is apparent, through analysis of the XML file and through Match commands, that the XML only contains "ORF" names for most of its genes (see [[Blitvak Week 14|Week 14 assignment]], and the [[GÉNialOMICS Gene Database Testing Report (Initial Export)|initial export testing report]]).

==Using XMLPipeDB match to Validate the XML Results from the TallyEngine==
*The Windows command line was launched (cmd.exe)
*This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
**NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory)
*[[File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png]]
*7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?
*These results are very different from what was found through TallyEngine because these TallyEngine results, as mentioned in the [[Blitvak Week 14|Week 14 assignment]], only represent the gene name data in the XML that is tagged as being "ordered locus". The match command was found to reflect the data that is found under the "ORF" tag; the ORF and ordered locus counts are both very different, and this is reflected in the difference between TallyEngine and XMLPipeDB match with respect to the gene name counts.

==Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==
*''pgAdmin III'' was booted and all of the necessary connections were made
*It was realized that the ''gene/name'' tags in the XML file end up in the ''genenametype'' table (source: [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways |the wiki page regarding database quality analysis]]
*In ''pgAdmin III'', the query <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]';</code> was issued via the SQL Query menu in order to validate the TallyEngine count for "orderedlocusnames" for the PSQL database.
**337 unique matches were found in ''pgAdmin III'' (postgres database results). This lines up with what was found in TallyEngine.
*Additionally, the query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';</code> was run via SQL in order to the verify the ORF counts (compared to the results that were found using XMLPipeDB match, see [[Blitvak Week 14|the week 14 assignment]].
**7121 counts were found which lines up with what was found through XMLPipeDB match and through an analysis of the XML file
*At this point, it was assumed that the data in the genenametype table of the PSQL database is identical to what was within the same table in the initial export PSQL database.
*Are your results the same as reported by the TallyEngine? Why or why not?
**The "ordered locus" results are the same as what was reported by TallyEngine since both are focusing on the same set of data.

==OriginalRowCounts Comparison==

*The newly created J2315 .gdb file (Bc-Std_GEN_BL12_20151201.gdb) was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, [http://www.alexnolan.net/software/mdb_viewer_plus.htm MDB Viewer Plus] was utilized.
*Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
*'''OriginalRowCounts for Build 2 Export of J2315'''
**[[File:Build2ExportOriginalRowCounts GEN BL14 20151201.png]]
*It was decided that a good reference or "benchmark" would be the database that was created for the initial, dry, export of the gene data related to J2315; comparing the two should bring to light any differences that are the result of the export.
*Benchmark .gdb file: [[Media:Bc-Std GEN BL12 20151119.zip |compressed Bc-Std_GEN_BL12_20151119.gdb]]
*'''OriginalRowCounts for Initial Export of J2315'''
**[[File:OriginalRowCounts(initial_export)_GEN_BL12_20151123.png]]

Note: It was noticed that the OriginalRowCounts table in this export is identical to the one found in the [[GÉNialOMICS Gene Database Testing Report (Initial Export)|initial export]]; this is not surprising since the utilized GenMAPP builder is mostly identical to one that was utilized in the initial export.

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.

* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
**Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
*In the UniProt table, it appears that the ID and EntryName columns involve the correct ID form for J2315. The GeneName column in UniProt, however, appears to be missing most of its entries. No gene names in the basic form of '''BCA[S,L,M]####''' and '''BceJ2315_#####''' can be found. Very few gene names are present, and those present are in the form of either four letters with the final letter being capital or in the form of three uncapitalized letters.
*RefSeq table appears to be in order
*OrderedLocusNames table, as suggested by earlier analysis, only contains 337 rows and IDs in the format of '''BceJ2315_#####'''.

Note: It was noticed that the UniProt table only contained gene names that are either of "ordered locus" type or in the format of four letters.

==.gdb Use in GenMAPP==
*Some of the protocol from [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] was used as a reference for this portion of the assignment
*''Bc-Std_Build2_GEN_BL14_20151201.gdb'' was placed within the Gene Databases folder of the GenMAPP directory (the folder is within the GenMAPP 2 Data folder)
*GenMAPP (Version 2.1) was launched
*The new gene database was loaded by going into ''Data > Choose Gene Database''
*The tab deliminated GenMAPP formatted [[Media:For_genMAPP_KWVP20151205.txt|data]] sourced from the microarray paper was loaded into GenMAPP through ''Data > Expression Dataset Manager > Expression Datasets > New Dataset > ''GenMAPP formatted microarray data_GEN_B14_20151207.txt''

*Note: There were no glaring issues with loading the files into GenMAPP (no crashes). However, this gene database led to the detection of 7251 errors in the loaded data. It is suspected that this gene database does not cover the majority of the genes within the microarray data (which is expected, since the microarray data represents ORF gene names)

==Compare Gene Database to Outside Resource==
Outside Resource: [http://beta.burkholderia.com/ Burkholderia Genome DB]
*The strain page for J2315 was looked up: [http://beta.burkholderia.com/strain/show/146]
*Only 337 OrderedLocusNames IDs were found in the exported database; 7384 annotated genes, however, are present in the MOD
**[[File:J2315MODGENES_GEN_BL12_20151123.png]]

Note: A lot more OrderedLocusNames IDs should be present in the exported database than the counts that were found. Data on the MOD and executed Match queries help confirm this. Current number of OrderedLocusNames (337) is very far from the numbers that was seen in the MOD (7384 annotated genes, with 7114 involved with the coding of protein). The count is so low, it is now known, due to the fact that GenMAPP builder, at the moment, is programmed to only pick-up "ordered locus" data from the XML; most of the gene names reside as "ORF" data, which explains the fact that most of the data is not present in the export.

----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

GÉNialOMICS Gene Database Testing Report (Build 4 Export)

2015-12-08T05:57:55Z

Blitvak: wrapped up most the report

GÉNialOMICS Gene Database Testing Report (Build 4 Export)

2015-12-08T05:46:51Z

Blitvak: version one of build 4 export

==Export Information==

Version of GenMAPP Builder: [[Media:gmbuilder-genialomics-12032015-build-4.zip|GenMAPP Builder Custom, Build 4]]

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151204_BUILD4_genialomics

UniProt XML filename: [[Media:Uniprot-taxonomy-216591 GEN BL12 20151119.zip|uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml]]
* UniProt XML version: UniProt release 2015_11 - November 11, 2015
* UniProt XML download link: [http://www.uniprot.org/uniprot/?query=organism:216591 UniProtKB link for the complete proteome of J2315]
* Time taken to import: 3.46 minutes
** Note: Time taken appears to be slightly shorter than previous exports.

GO OBO-XML filename: [[Media:Go daily-termdb GEN BL12 20151119.zip|go_daily-termdb_GEN_BL12_20151119.obo-xml]]
* GO OBO-XML version (derived from the date modified on the file, itself): ''Date Modified: 11/19/2015 2:24 AM''
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Link from GO website]
* Time taken to import: 5.05 minutes
* Time taken to process: 3.75 minutes
** Note: Time taken appears to be slightly shorter than previous exports.

GOA filename: [[Media:31277.B cepacia GEN BL12 20151119.zip | 31277.B_cepacia_GEN_BL12_20151119.goa]]
* GOA version: ''Date Modified: 11/10/15, 1:47:00 PM'' (information sourced from FTP site)
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/31277.B_cepacia.goa FTP site file]
* Time taken to import: 0.04 Minutes
** Note: No issues were found with the import of this file.

Name of .gdb file: [[Media:Bc-Std GEN Build4 20151204.zip|Bc-Std GEN Build4 20151204.gdb]]
* Time taken to export: 11 hours 6 minutes
** Start time: 7:51 am
** End time: 6:57 pm
** Note: File was exported without any major issues, however, the export appeared to take significantly longer than the previous exports. It is likely that the export took so long because the workstation had, for some period of time, entered a "sleep" mode (export was delayed, as the computer had to be taken off of "sleep").

==Using TallyEngine==
*PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151204_BUILD4_genialomics was left running
*GenMAPP builder was booted and ''Run XML and Database Tallies for UniProt and GO'' was selected under the ''Tallies'' menu item; the UniProt XML and GO files that were imported were chosen
*'''Results of TallyEngine:'''
*[[File:Build4tallyengine_results_GEN_BL14_20151204.png]]
**Note: These results differ significantly from what was found in previous exports. The 337 Ordered Locus gene names are now distinct from the 7121 ORF gene names (and are represented, as such, by TallyEngine). All of the counts related to external references (like UniProt) remain the same. The major and crucial change is the inclusion and representation of the ORF data.

==Using XMLPipeDB match to Validate the XML Results from the TallyEngine==
*The Windows command line was launched (cmd.exe)
*This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
**NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory). The results were identical to what was found in the [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 export]].
*[[File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png]]
*7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?
*These results vary slightly from what was found by TallyEngine due to the presence of 5 discrepant IDs (which were identified in the [[Blitvak Week 14|Week 14 assignment]]). Barring those 5 IDs, the results by XMLPipeDB Match line up with what TallyEngine reports (since the Match query grabs ORF data).

==Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==
*''pgAdmin III'' was booted and all of the necessary connections were made
*In ''pgAdmin III'', the query <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]';</code> was issued via the SQL Query menu in order to validate the TallyEngine count for "Ordered Locus" for the PSQL database.
**337 unique matches were found in ''pgAdmin III'' (postgres database results). This lines up with what was found in TallyEngine.
*Additionally, the query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';</code> was run via SQL in order to the verify the ORF counts
**7121 counts were found which is identical to what was found through XMLPipeDB match (ignoring the discrepant IDs) and to what was reported by TallyEngine (for the ORF data).
*Are your results the same as reported by the TallyEngine? Why or why not?
**The results are now the same as what was reported by TallyEngine; this is due to the fact that the most recent build incorporated code fixes that allowed GenMAPP builder, and TallyEngine, to properly include the ORF data in their analysis/work.

==OriginalRowCounts Comparison==
*The newly created J2315 .gdb file (Bc-Std_GEN_BL12_20151201.gdb) was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, [http://www.alexnolan.net/software/mdb_viewer_plus.htm MDB Viewer Plus] was utilized.
*Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
*'''OriginalRowCounts for Build 4 export of J2315'''
**[[File:build4OriginalRowCounts_GEN_BL14_20151204.png]]
*It was decided that a good reference or "benchmark" would be the database that was created using Build 3 of the customized GenMAPP builder; comparing the two should allow me to see if there was any difference in the imported data (and any difference in the functionality of GenMAPP builder).
*Benchmark .gdb file: [[Media:Bc-Std_GEN_Build3_20151203.zip|compressed Bc-Std_GEN_Build3_20151203.gdb]]
*'''OriginalRowCounts for the Build 3 export of J2315'''
**[[File:Build3OriginalRowCounts GEN BL14 20151203.png]]
Note: It was noticed that the OriginalRowCounts table in this export are mostly identical to the one found through the Build 2 export. However, it was noticed that there existed differences in the OrderedLocusNames table between the two exports. It was found that the recent export, the Build 3 export, contained 7121 rows in the OrderedLocusNames table (which indicates 7121 entries, which is the same as the number of ORF gene names in the XML), while the last export, the Build 2 export, contained 337 rows in the OrderedLocusNames tables. The fact that the build 3 export how shows 7121 entries in that table is indicative of the fact that this modified GenMAPP builder (build 3) is now focusing on the ORF data; it appears, however, that it is now labeling the "ORF" data as being OrderedLocusNames instead of the "ordered locus" data. The observation in the OriginalRowCounts table does not completely mesh with what was found earlier in the PSQL database. In the PSQL database, it was found that the OrderedLocusName data was still the "ordered locus" gene names that reside in the XML (and the "ORF" data are the 7121 gene names of interest). In conclusion, it feels that there are some issues with TallyEngine and GenMAPP builder that are leading to some issues (such as TallyEngine not reporting the ORF data).

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
**Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
*In the UniProt table, like before, it is apparent that only gene names of the type "ordered locus" are represented (no signs of gene names that begin with something like "BCA"). The RefSeq table appears to not have any problems. The ordered locus names table, now, only reflects gene names in the form of <code>p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]</code>; it appears that the "ORF" data replaced the "ordered locus" gene names in this table (these IDs appear to be in the correct and common form).

Note: The modifications to GenMAPP builder appear to have changed some of the data within the tables of the gene database (ORF gene names replacing "ordered locus" gene names, with respect to the OrderedLocusNames table).

==.gdb Use in GenMAPP==

Note: To do.

==Compare Gene Database to Outside Resource==
Outside Resource: [http://beta.burkholderia.com/ Burkholderia Genome DB], [http://www.uniprot.org/uniprot/ UniProt KB]
*The strain page for J2315 was looked up: [http://beta.burkholderia.com/strain/show/146]
*7121 OrderedLocusNames were found within the exported gdb file. 6,994 entries corresponding to protein encoding genes were found in [http://www.uniprot.org/uniprot/ UniProt KB], and 7114 coding sequences were found in the [http://beta.burkholderia.com/strain/show/146 MOD. It is apparent that the count of 7121 (ORF data) is much closer to what is present in outside resources than the one of 337 ("ordered locus" data). The differences in count between UniProt and the gdb and MOD could be the result of the fact that UniProt only covers genes that code for protein (some of the coding sequences present in the MOD, or within the gdb, could be responsible for functional RNA, which are not covered by UniProt).

*Note: The exported database now seems more in-line with what is to be expected of the genome of ''B. cenocepacia''; the current OrderedLocusName counts (which actually represents ORF counts) seem very close to the counts expressed by the MOD and by UniProt.
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

GÉNialOMICS Gene Database Testing Report (Build 3 Export)

2015-12-08T05:45:57Z

Blitvak: fixed incorrect link, for .gdb file

==Export Information==

Version of GenMAPP Builder: [[Media:gmbuilder-genialomics-12032015-build-3.zip|GenMAPP Builder Custom, Build 3]]

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151203_BUILD3_genialomics

UniProt XML filename: [[Media:Uniprot-taxonomy-216591 GEN BL12 20151119.zip|uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml]]
* UniProt XML version: UniProt release 2015_11 - November 11, 2015
* UniProt XML download link: [http://www.uniprot.org/uniprot/?query=organism:216591 UniProtKB link for the complete proteome of J2315]
* Time taken to import: 3.99 minutes
** Note: No issues were found with the import of this file.

GO OBO-XML filename: [[Media:Go daily-termdb GEN BL12 20151119.zip|go_daily-termdb_GEN_BL12_20151119.obo-xml]]
* GO OBO-XML version (derived from the date modified on the file, itself): ''Date Modified: 11/19/2015 2:24 AM''
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Link from GO website]
* Time taken to import: 5.77 minutes
* Time taken to process: 4.06 minutes
** Note: No issues were found with the import of this file.

GOA filename: [[Media:31277.B cepacia GEN BL12 20151119.zip | 31277.B_cepacia_GEN_BL12_20151119.goa]]
* GOA version: ''Date Modified: 11/10/15, 1:47:00 PM'' (information sourced from FTP site)
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/31277.B_cepacia.goa FTP site file]
* Time taken to import: 0.05 Minutes
** Note: No issues were found with the import of this file.

Name of .gdb file: [[Media:Bc-Std_GEN_Build3_20151203.zip|Bc-Std_GEN_Build3_20151203.gdb]]
* Time taken to export: 4 hours 37 minutes
** Start time: 7.24 pm
** End time: 12:01 am
** Note: File was exported without any major issues, however, the export appeared to take even longer than the one conducted for the [[GÉNialOMICS Gene Database Testing Report (Initial Export) | initial export]]. This export took a little over 2 hours longer than the previous export and it is suspected that this difference is due to the fact that work was being done on the computer while the export was taking place.

==Using TallyEngine==
*PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151203_BUILD3_genialomics was left running
*GenMAPP builder was booted and ''Run XML and Database Tallies for UniProt and GO'' was selected under the ''Tallies'' menu item; the UniProt XML and GO files that were imported were chosen
*'''Results of TallyEngine:'''
*[[File:Build3tallyengine results GEN BL14 20151203.png]]
**Note: These results are identical to what was found in the initial export and in the export involving the second build of a modified genmapp builder (see [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 testing report]]). Since GenMAPP builder was modified, for Build 3, so that the gene names will be collected by the program from the ORF data rather than the ordered locus data, it appears that there exists some errors in the program that are preventing it from properly collecting and taking into account the "ORF" data that resides in the XML file.

==Using XMLPipeDB match to Validate the XML Results from the TallyEngine==
*The Windows command line was launched (cmd.exe)
*This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
**NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory). The results were identical to what was found in the [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 export]].
*[[File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png]]
*7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?
*These results are very different from what was found through TallyEngine because these TallyEngine results, as mentioned in the [[Blitvak Week 14|Week 14 assignment]], only represent the gene name data in the XML that is tagged as being "ordered locus". The match command was found to reflect the data that is found under the "ORF" tag; the ORF and ordered locus counts are both very different, and this is reflected in the difference between TallyEngine and XMLPipeDB match with respect to the gene name counts.

==Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==
*''pgAdmin III'' was booted and all of the necessary connections were made
*It was realized that the ''gene/name'' tags in the XML file end up in the ''genenametype'' table (source: [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways |the wiki page regarding database quality analysis]]
*In ''pgAdmin III'', the query <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]';</code> was issued via the SQL Query menu in order to validate the TallyEngine count for "orderedlocusnames" for the PSQL database.
**337 unique matches were found in ''pgAdmin III'' (postgres database results). This lines up with what was found in TallyEngine.
*Once again, the query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';</code> was run via SQL in order to the verify the ORF counts (compared to the results that were found using XMLPipeDB match, see [[Blitvak Week 14|the week 14 assignment]].
**7121 counts were found which lines up with what was found through XMLPipeDB match and through an analysis of the XML file
*At this point, it was once again assumed that the data in the genenametype table of the PSQL database is identical to what was within the same table in the initial export PSQL database.
*Are your results the same as reported by the TallyEngine? Why or why not?
**The "ordered locus" results are the same as what was reported by TallyEngine since both are focusing on the same set of data. TallyEngine was modified to focus upon the "ORF" data, however, it appears that there are issues that are preventing it from doing so.

==OriginalRowCounts Comparison==
*The newly created J2315 .gdb file (Bc-Std_GEN_BL12_20151201.gdb) was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, [http://www.alexnolan.net/software/mdb_viewer_plus.htm MDB Viewer Plus] was utilized.
*Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
*'''OriginalRowCounts for Build 3 export of J2315'''
**[[File:Build3OriginalRowCounts GEN BL14 20151203.png]]
*It was decided that a good reference or "benchmark" would be the database that was created using Build 2 of the customized GenMAPP builder; comparing the two should allow me to see if there was any difference in the imported data.
*Benchmark .gdb file: [[Media:Bc-Std Build2 GEN BL14 20151201.zip|compressed Bc-Std_Build2_GEN_BL14_20151201.gdb]]
*'''OriginalRowCounts for the Build 2 export of J2315'''
**[[File:Build2ExportOriginalRowCounts GEN BL14 20151201.png]]
Note: It was noticed that the OriginalRowCounts table in this export are mostly identical to the one found through the Build 2 export. However, it was noticed that there existed differences in the OrderedLocusNames table between the two exports. It was found that the recent export, the Build 3 export, contained 7121 rows in the OrderedLocusNames table (which indicates 7121 entries, which is the same as the number of ORF gene names in the XML), while the last export, the Build 2 export, contained 337 rows in the OrderedLocusNames tables. The fact that the build 3 export how shows 7121 entries in that table is indicative of the fact that this modified GenMAPP builder (build 3) is now focusing on the ORF data; it appears, however, that it is now labeling the "ORF" data as being OrderedLocusNames instead of the "ordered locus" data. The observation in the OriginalRowCounts table does not completely mesh with what was found earlier in the PSQL database. In the PSQL database, it was found that the OrderedLocusName data was still the "ordered locus" gene names that reside in the XML (and the "ORF" data are the 7121 gene names of interest). In conclusion, it feels that there are some issues with TallyEngine and GenMAPP builder that are leading to some issues (such as TallyEngine not reporting the ORF data).

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
**Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
*In the UniProt table, like before, it is apparent that only gene names of the type "ordered locus" are represented (no signs of gene names that begin with something like "BCA"). The RefSeq table appears to not have any problems. The ordered locus names table, now, only reflects gene names in the form of <code>p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]</code>; it appears that the "ORF" data replaced the "ordered locus" gene names in this table (these IDs appear to be in the correct and common form).

Note: The modifications to GenMAPP builder appear to have changed some of the data within the tables of the gene database (ORF gene names replacing "ordered locus" gene names, with respect to the OrderedLocusNames table).

==.gdb Use in GenMAPP==

Note: To do.

==Compare Gene Database to Outside Resource==
Outside Resource: [http://beta.burkholderia.com/ Burkholderia Genome DB], [http://www.uniprot.org/uniprot/ UniProt KB]
*The strain page for J2315 was looked up: [http://beta.burkholderia.com/strain/show/146]
*7121 OrderedLocusNames were found within the exported gdb file. 6,994 entries corresponding to protein encoding genes were found in [http://www.uniprot.org/uniprot/ UniProt KB], and 7114 coding sequences were found in the [http://beta.burkholderia.com/strain/show/146 MOD. It is apparent that the count of 7121 (ORF data) is much closer to what is present in outside resources than the one of 337 ("ordered locus" data). The differences in count between UniProt and the gdb and MOD could be the result of the fact that UniProt only covers genes that code for protein (some of the coding sequences present in the MOD, or within the gdb, could be responsible for functional RNA, which are not covered by UniProt).

*Note: The exported database now seems more in-line with what is to be expected of the genome of ''B. cenocepacia''; the current OrderedLocusName counts (which actually represents ORF counts) seem very close to the counts expressed by the MOD and by UniProt.
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

File:Bc-Std GEN Build3 20151203.zip

2015-12-08T05:45:12Z

Blitvak: exported gdb from build 3 of the modified genmapp builder

exported gdb from build 3 of the modified genmapp builder

GÉNialOMICS Gene Database Testing Report (Build 3 Export)

2015-12-08T05:42:54Z

Blitvak: fixed match section

==Export Information==

Version of GenMAPP Builder: [[Media:gmbuilder-genialomics-12032015-build-3.zip|GenMAPP Builder Custom, Build 3]]

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151203_BUILD3_genialomics

UniProt XML filename: [[Media:Uniprot-taxonomy-216591 GEN BL12 20151119.zip|uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml]]
* UniProt XML version: UniProt release 2015_11 - November 11, 2015
* UniProt XML download link: [http://www.uniprot.org/uniprot/?query=organism:216591 UniProtKB link for the complete proteome of J2315]
* Time taken to import: 3.99 minutes
** Note: No issues were found with the import of this file.

GO OBO-XML filename: [[Media:Go daily-termdb GEN BL12 20151119.zip|go_daily-termdb_GEN_BL12_20151119.obo-xml]]
* GO OBO-XML version (derived from the date modified on the file, itself): ''Date Modified: 11/19/2015 2:24 AM''
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Link from GO website]
* Time taken to import: 5.77 minutes
* Time taken to process: 4.06 minutes
** Note: No issues were found with the import of this file.

GOA filename: [[Media:31277.B cepacia GEN BL12 20151119.zip | 31277.B_cepacia_GEN_BL12_20151119.goa]]
* GOA version: ''Date Modified: 11/10/15, 1:47:00 PM'' (information sourced from FTP site)
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/31277.B_cepacia.goa FTP site file]
* Time taken to import: 0.05 Minutes
** Note: No issues were found with the import of this file.

Name of .gdb file: [[Media:Bc-Std Build2 GEN BL14 20151201.zip|Bc-Std_Build2_GEN_BL14_20151201.gdb]]
* Time taken to export: 4 hours 37 minutes
** Start time: 7.24 pm
** End time: 12:01 am
** Note: File was exported without any major issues, however, the export appeared to take even longer than the one conducted for the [[GÉNialOMICS Gene Database Testing Report (Initial Export) | initial export]]. This export took a little over 2 hours longer than the previous export and it is suspected that this difference is due to the fact that work was being done on the computer while the export was taking place.

==Using TallyEngine==
*PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151203_BUILD3_genialomics was left running
*GenMAPP builder was booted and ''Run XML and Database Tallies for UniProt and GO'' was selected under the ''Tallies'' menu item; the UniProt XML and GO files that were imported were chosen
*'''Results of TallyEngine:'''
*[[File:Build3tallyengine results GEN BL14 20151203.png]]
**Note: These results are identical to what was found in the initial export and in the export involving the second build of a modified genmapp builder (see [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 testing report]]). Since GenMAPP builder was modified, for Build 3, so that the gene names will be collected by the program from the ORF data rather than the ordered locus data, it appears that there exists some errors in the program that are preventing it from properly collecting and taking into account the "ORF" data that resides in the XML file.

==Using XMLPipeDB match to Validate the XML Results from the TallyEngine==
*The Windows command line was launched (cmd.exe)
*This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
**NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory). The results were identical to what was found in the [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 export]].
*[[File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png]]
*7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?
*These results are very different from what was found through TallyEngine because these TallyEngine results, as mentioned in the [[Blitvak Week 14|Week 14 assignment]], only represent the gene name data in the XML that is tagged as being "ordered locus". The match command was found to reflect the data that is found under the "ORF" tag; the ORF and ordered locus counts are both very different, and this is reflected in the difference between TallyEngine and XMLPipeDB match with respect to the gene name counts.

==Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==
*''pgAdmin III'' was booted and all of the necessary connections were made
*It was realized that the ''gene/name'' tags in the XML file end up in the ''genenametype'' table (source: [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways |the wiki page regarding database quality analysis]]
*In ''pgAdmin III'', the query <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]';</code> was issued via the SQL Query menu in order to validate the TallyEngine count for "orderedlocusnames" for the PSQL database.
**337 unique matches were found in ''pgAdmin III'' (postgres database results). This lines up with what was found in TallyEngine.
*Once again, the query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';</code> was run via SQL in order to the verify the ORF counts (compared to the results that were found using XMLPipeDB match, see [[Blitvak Week 14|the week 14 assignment]].
**7121 counts were found which lines up with what was found through XMLPipeDB match and through an analysis of the XML file
*At this point, it was once again assumed that the data in the genenametype table of the PSQL database is identical to what was within the same table in the initial export PSQL database.
*Are your results the same as reported by the TallyEngine? Why or why not?
**The "ordered locus" results are the same as what was reported by TallyEngine since both are focusing on the same set of data. TallyEngine was modified to focus upon the "ORF" data, however, it appears that there are issues that are preventing it from doing so.

==OriginalRowCounts Comparison==
*The newly created J2315 .gdb file (Bc-Std_GEN_BL12_20151201.gdb) was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, [http://www.alexnolan.net/software/mdb_viewer_plus.htm MDB Viewer Plus] was utilized.
*Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
*'''OriginalRowCounts for Build 3 export of J2315'''
**[[File:Build3OriginalRowCounts GEN BL14 20151203.png]]
*It was decided that a good reference or "benchmark" would be the database that was created using Build 2 of the customized GenMAPP builder; comparing the two should allow me to see if there was any difference in the imported data.
*Benchmark .gdb file: [[Media:Bc-Std Build2 GEN BL14 20151201.zip|compressed Bc-Std_Build2_GEN_BL14_20151201.gdb]]
*'''OriginalRowCounts for the Build 2 export of J2315'''
**[[File:Build2ExportOriginalRowCounts GEN BL14 20151201.png]]
Note: It was noticed that the OriginalRowCounts table in this export are mostly identical to the one found through the Build 2 export. However, it was noticed that there existed differences in the OrderedLocusNames table between the two exports. It was found that the recent export, the Build 3 export, contained 7121 rows in the OrderedLocusNames table (which indicates 7121 entries, which is the same as the number of ORF gene names in the XML), while the last export, the Build 2 export, contained 337 rows in the OrderedLocusNames tables. The fact that the build 3 export how shows 7121 entries in that table is indicative of the fact that this modified GenMAPP builder (build 3) is now focusing on the ORF data; it appears, however, that it is now labeling the "ORF" data as being OrderedLocusNames instead of the "ordered locus" data. The observation in the OriginalRowCounts table does not completely mesh with what was found earlier in the PSQL database. In the PSQL database, it was found that the OrderedLocusName data was still the "ordered locus" gene names that reside in the XML (and the "ORF" data are the 7121 gene names of interest). In conclusion, it feels that there are some issues with TallyEngine and GenMAPP builder that are leading to some issues (such as TallyEngine not reporting the ORF data).

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
**Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
*In the UniProt table, like before, it is apparent that only gene names of the type "ordered locus" are represented (no signs of gene names that begin with something like "BCA"). The RefSeq table appears to not have any problems. The ordered locus names table, now, only reflects gene names in the form of <code>p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]</code>; it appears that the "ORF" data replaced the "ordered locus" gene names in this table (these IDs appear to be in the correct and common form).

Note: The modifications to GenMAPP builder appear to have changed some of the data within the tables of the gene database (ORF gene names replacing "ordered locus" gene names, with respect to the OrderedLocusNames table).

==.gdb Use in GenMAPP==

Note: To do.

==Compare Gene Database to Outside Resource==
Outside Resource: [http://beta.burkholderia.com/ Burkholderia Genome DB], [http://www.uniprot.org/uniprot/ UniProt KB]
*The strain page for J2315 was looked up: [http://beta.burkholderia.com/strain/show/146]
*7121 OrderedLocusNames were found within the exported gdb file. 6,994 entries corresponding to protein encoding genes were found in [http://www.uniprot.org/uniprot/ UniProt KB], and 7114 coding sequences were found in the [http://beta.burkholderia.com/strain/show/146 MOD. It is apparent that the count of 7121 (ORF data) is much closer to what is present in outside resources than the one of 337 ("ordered locus" data). The differences in count between UniProt and the gdb and MOD could be the result of the fact that UniProt only covers genes that code for protein (some of the coding sequences present in the MOD, or within the gdb, could be responsible for functional RNA, which are not covered by UniProt).

*Note: The exported database now seems more in-line with what is to be expected of the genome of ''B. cenocepacia''; the current OrderedLocusName counts (which actually represents ORF counts) seem very close to the counts expressed by the MOD and by UniProt.
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

File:Build4OriginalRowCounts GEN BL14 20151204.png

2015-12-08T05:39:18Z

Blitvak: GENIALOMICS WEEK 14: from build 4, original row counts found using mdb viewer

GENIALOMICS WEEK 14: from build 4, original row counts found using mdb viewer

File:Build4tallyengine results GEN BL14 20151204.png

2015-12-08T05:27:01Z

Blitvak: build 4 tallyengine results, export Q/A, week 14, conducted 12/4

build 4 tallyengine results, export Q/A, week 14, conducted 12/4

File:Bc-Std GEN Build4 20151204.zip

2015-12-08T05:19:36Z

Blitvak: gdb exported using the build 4 of the modified genmapp builder

gdb exported using the build 4 of the modified genmapp builder

GÉNialOMICS Gene Database Testing Report (Build 3 Export)

2015-12-08T05:13:08Z

Blitvak: wrapped up the core of the report for build 3

==Export Information==

Version of GenMAPP Builder: [[Media:gmbuilder-genialomics-12032015-build-3.zip|GenMAPP Builder Custom, Build 3]]

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151203_BUILD3_genialomics

UniProt XML filename: [[Media:Uniprot-taxonomy-216591 GEN BL12 20151119.zip|uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml]]
* UniProt XML version: UniProt release 2015_11 - November 11, 2015
* UniProt XML download link: [http://www.uniprot.org/uniprot/?query=organism:216591 UniProtKB link for the complete proteome of J2315]
* Time taken to import: 3.99 minutes
** Note: No issues were found with the import of this file.

GO OBO-XML filename: [[Media:Go daily-termdb GEN BL12 20151119.zip|go_daily-termdb_GEN_BL12_20151119.obo-xml]]
* GO OBO-XML version (derived from the date modified on the file, itself): ''Date Modified: 11/19/2015 2:24 AM''
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Link from GO website]
* Time taken to import: 5.77 minutes
* Time taken to process: 4.06 minutes
** Note: No issues were found with the import of this file.

GOA filename: [[Media:31277.B cepacia GEN BL12 20151119.zip | 31277.B_cepacia_GEN_BL12_20151119.goa]]
* GOA version: ''Date Modified: 11/10/15, 1:47:00 PM'' (information sourced from FTP site)
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/31277.B_cepacia.goa FTP site file]
* Time taken to import: 0.05 Minutes
** Note: No issues were found with the import of this file.

Name of .gdb file: [[Media:Bc-Std Build2 GEN BL14 20151201.zip|Bc-Std_Build2_GEN_BL14_20151201.gdb]]
* Time taken to export: 4 hours 37 minutes
** Start time: 7.24 pm
** End time: 12:01 am
** Note: File was exported without any major issues, however, the export appeared to take even longer than the one conducted for the [[GÉNialOMICS Gene Database Testing Report (Initial Export) | initial export]]. This export took a little over 2 hours longer than the previous export and it is suspected that this difference is due to the fact that work was being done on the computer while the export was taking place.

==Using TallyEngine==
*PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151203_BUILD3_genialomics was left running
*GenMAPP builder was booted and ''Run XML and Database Tallies for UniProt and GO'' was selected under the ''Tallies'' menu item; the UniProt XML and GO files that were imported were chosen
*'''Results of TallyEngine:'''
*[[File:Build3tallyengine results GEN BL14 20151203.png]]
**Note: These results are identical to what was found in the initial export and in the export involving the second build of a modified genmapp builder (see [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 testing report]]). Since GenMAPP builder was modified, for Build 3, so that the gene names will be collected by the program from the ORF data rather than the ordered locus data, it appears that there exists some errors in the program that are preventing it from properly collecting and taking into account the "ORF" data that resides in the XML file.

==Using XMLPipeDB match to Validate the XML Results from the TallyEngine==
*The Windows command line was launched (cmd.exe)
*This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
**NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory). The results were identical to what was found in the [[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|the build 2 export]].
*[[File:Tallyengine results GEN BL14 20151201.png]]
*7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?
*These results are very different from what was found through TallyEngine because these TallyEngine results, as mentioned in the [[Blitvak Week 14|Week 14 assignment]], only represent the gene name data in the XML that is tagged as being "ordered locus". The match command was found to reflect the data that is found under the "ORF" tag; the ORF and ordered locus counts are both very different, and this is reflected in the difference between TallyEngine and XMLPipeDB match with respect to the gene name counts.

==Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==
*''pgAdmin III'' was booted and all of the necessary connections were made
*It was realized that the ''gene/name'' tags in the XML file end up in the ''genenametype'' table (source: [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways |the wiki page regarding database quality analysis]]
*In ''pgAdmin III'', the query <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]';</code> was issued via the SQL Query menu in order to validate the TallyEngine count for "orderedlocusnames" for the PSQL database.
**337 unique matches were found in ''pgAdmin III'' (postgres database results). This lines up with what was found in TallyEngine.
*Once again, the query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';</code> was run via SQL in order to the verify the ORF counts (compared to the results that were found using XMLPipeDB match, see [[Blitvak Week 14|the week 14 assignment]].
**7121 counts were found which lines up with what was found through XMLPipeDB match and through an analysis of the XML file
*At this point, it was once again assumed that the data in the genenametype table of the PSQL database is identical to what was within the same table in the initial export PSQL database.
*Are your results the same as reported by the TallyEngine? Why or why not?
**The "ordered locus" results are the same as what was reported by TallyEngine since both are focusing on the same set of data. TallyEngine was modified to focus upon the "ORF" data, however, it appears that there are issues that are preventing it from doing so.

==OriginalRowCounts Comparison==
*The newly created J2315 .gdb file (Bc-Std_GEN_BL12_20151201.gdb) was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, [http://www.alexnolan.net/software/mdb_viewer_plus.htm MDB Viewer Plus] was utilized.
*Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
*'''OriginalRowCounts for Build 3 export of J2315'''
**[[File:Build3OriginalRowCounts GEN BL14 20151203.png]]
*It was decided that a good reference or "benchmark" would be the database that was created using Build 2 of the customized GenMAPP builder; comparing the two should allow me to see if there was any difference in the imported data.
*Benchmark .gdb file: [[Media:Bc-Std Build2 GEN BL14 20151201.zip|compressed Bc-Std_Build2_GEN_BL14_20151201.gdb]]
*'''OriginalRowCounts for the Build 2 export of J2315'''
**[[File:Build2ExportOriginalRowCounts GEN BL14 20151201.png]]
Note: It was noticed that the OriginalRowCounts table in this export are mostly identical to the one found through the Build 2 export. However, it was noticed that there existed differences in the OrderedLocusNames table between the two exports. It was found that the recent export, the Build 3 export, contained 7121 rows in the OrderedLocusNames table (which indicates 7121 entries, which is the same as the number of ORF gene names in the XML), while the last export, the Build 2 export, contained 337 rows in the OrderedLocusNames tables. The fact that the build 3 export how shows 7121 entries in that table is indicative of the fact that this modified GenMAPP builder (build 3) is now focusing on the ORF data; it appears, however, that it is now labeling the "ORF" data as being OrderedLocusNames instead of the "ordered locus" data. The observation in the OriginalRowCounts table does not completely mesh with what was found earlier in the PSQL database. In the PSQL database, it was found that the OrderedLocusName data was still the "ordered locus" gene names that reside in the XML (and the "ORF" data are the 7121 gene names of interest). In conclusion, it feels that there are some issues with TallyEngine and GenMAPP builder that are leading to some issues (such as TallyEngine not reporting the ORF data).

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.
* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
**Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
*In the UniProt table, like before, it is apparent that only gene names of the type "ordered locus" are represented (no signs of gene names that begin with something like "BCA"). The RefSeq table appears to not have any problems. The ordered locus names table, now, only reflects gene names in the form of <code>p?BCA[L,M,S]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]</code>; it appears that the "ORF" data replaced the "ordered locus" gene names in this table (these IDs appear to be in the correct and common form).

Note: The modifications to GenMAPP builder appear to have changed some of the data within the tables of the gene database (ORF gene names replacing "ordered locus" gene names, with respect to the OrderedLocusNames table).

==.gdb Use in GenMAPP==

Note: To do.

==Compare Gene Database to Outside Resource==
Outside Resource: [http://beta.burkholderia.com/ Burkholderia Genome DB], [http://www.uniprot.org/uniprot/ UniProt KB]
*The strain page for J2315 was looked up: [http://beta.burkholderia.com/strain/show/146]
*7121 OrderedLocusNames were found within the exported gdb file. 6,994 entries corresponding to protein encoding genes were found in [http://www.uniprot.org/uniprot/ UniProt KB], and 7114 coding sequences were found in the [http://beta.burkholderia.com/strain/show/146 MOD. It is apparent that the count of 7121 (ORF data) is much closer to what is present in outside resources than the one of 337 ("ordered locus" data). The differences in count between UniProt and the gdb and MOD could be the result of the fact that UniProt only covers genes that code for protein (some of the coding sequences present in the MOD, or within the gdb, could be responsible for functional RNA, which are not covered by UniProt).

*Note: The exported database now seems more in-line with what is to be expected of the genome of ''B. cenocepacia''; the current OrderedLocusName counts (which actually represents ORF counts) seem very close to the counts expressed by the MOD and by UniProt.
----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}

GÉNialOMICS Gene Database Testing Report (Build 2 Export)

2015-12-08T04:40:16Z

Blitvak: additional fixes applied to original row counts section

==Export Information==

Version of GenMAPP Builder: [[Media:gmbuilder-genialomics-12012015-build-2.zip|GenMAPP Builder Custom, Build 2]]

Computer on which the export was run: Home Workstation

Postgres Database name: B.cenocepacia_J2315_20151201_BUILD2_genialomics

UniProt XML filename: [[Media:Uniprot-taxonomy-216591 GEN BL12 20151119.zip|uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml]]
* UniProt XML version: UniProt release 2015_11 - November 11, 2015
* UniProt XML download link: [http://www.uniprot.org/uniprot/?query=organism:216591 UniProtKB link for the complete proteome of J2315]
* Time taken to import: 3.72 minutes
** Note: No issues were found with the import of this file.

GO OBO-XML filename: [[Media:Go daily-termdb GEN BL12 20151119.zip|go_daily-termdb_GEN_BL12_20151119.obo-xml]]
* GO OBO-XML version (derived from the date modified on the file, itself): ''Date Modified: 11/19/2015 2:24 AM''
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Link from GO website]
* Time taken to import: 5.25 minutes
* Time taken to process: 3.91 minutes
** Note: No issues were found with the import of this file.

GOA filename: [[Media:31277.B cepacia GEN BL12 20151119.zip | 31277.B_cepacia_GEN_BL12_20151119.goa]]
* GOA version: ''Date Modified: 11/10/15, 1:47:00 PM'' (information sourced from FTP site)
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/31277.B_cepacia.goa FTP site file]
* Time taken to import: 0.04 Minutes
** Note: No issues were found with the import of this file.

Name of .gdb file: [[Media:Bc-Std Build2 GEN BL14 20151201.zip|Bc-Std_Build2_GEN_BL14_20151201.gdb]]
* Time taken to export: 4 hours 22 minutes
** Start time: 10:27 pm
** End time: 2:49 am
** Note: File was exported without any major issues, however, the export appeared to take even longer than the one conducted for the [[GÉNialOMICS Gene Database Testing Report (Initial Export) | initial export]]. This export took almost 2 more hours than the previous export and it is suspected that this difference is due to the fact that work was being done on the computer while the export was taking place.

==Using TallyEngine==
*PostgreSQL was initialized through pgAdmin III and the database B.cenocepacia_J2315_20151201_BUILD2_genialomics was left running
*GenMAPP builder was booted and ''Run XML and Database Tallies for UniProt and GO'' was selected under the ''Tallies'' menu item; the UniProt XML and GO files that were imported were chosen
*'''Results of TallyEngine:'''
*[[File:Tallyengine results GEN BL14 20151201.png]]
**Note: These results are identical to what was found through the initial, dry, export of the gene database. This isn't all too surprising since the only difference in build 2 is that there exists a customized species profile for J2315. The data that GenMAPP builder collects from the XML is still the same since no major coding modifications were done on the program. As outlined in the [[Blitvak Week 14|Week 14 assignment]], future versions of GenMAPP builder should collect the data related to the "ORF" gene name data, rather than the "ordered locus" gene name data that is collected by default. It is apparent, through analysis of the XML file and through Match commands, that the XML only contains "ORF" names for most of its genes (see [[Blitvak Week 14|Week 14 assignment]], and the [[GÉNialOMICS Gene Database Testing Report (Initial Export)|initial export testing report]]).

==Using XMLPipeDB match to Validate the XML Results from the TallyEngine==
*The Windows command line was launched (cmd.exe)
*This set of commands was inputted into the command line in order to utilize XMLPipeDB match to verify the OrderedLocusNames count:
*<code>java -jar xmlpipedb-match-1.1.1.jar "p?BCA[L,S,M]?[0-9][0-9][0-9][A,a]?[0-9]?[A-Z, a-z]?" < "uniprot-taxonomy%3A216591_GEN_BL12_20151119.xml"</code>
**NOTE: Prior to executing the command, the folder that held the files and xmlpipedb-match-1.1.1.jar was entered through the Windows command line (a set of CD commands was used in order to enter the correct directory)
*[[File:XmlpipedbmatchOUTPUT GEN BL14 20151201.png]]
*7126 unique matches were found through XMLPipeDB match

Are your results the same as you got for the TallyEngine? Why or why not?
*These results are very different from what was found through TallyEngine because these TallyEngine results, as mentioned in the [[Blitvak Week 14|Week 14 assignment]], only represent the gene name data in the XML that is tagged as being "ordered locus". The match command was found to reflect the data that is found under the "ORF" tag; the ORF and ordered locus counts are both very different, and this is reflected in the difference between TallyEngine and XMLPipeDB match with respect to the gene name counts.

==Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==
*''pgAdmin III'' was booted and all of the necessary connections were made
*It was realized that the ''gene/name'' tags in the XML file end up in the ''genenametype'' table (source: [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways |the wiki page regarding database quality analysis]]
*In ''pgAdmin III'', the query <code>select count(*) from genenametype where type = 'ordered locus' and value ~ 'BceJ2315_[0-9][0-9][0-9][0-9]';</code> was issued via the SQL Query menu in order to validate the TallyEngine count for "orderedlocusnames" for the PSQL database.
**337 unique matches were found in ''pgAdmin III'' (postgres database results). This lines up with what was found in TallyEngine.
*Additionally, the query <code>select count(*) from genenametype where type = 'ORF' and value ~ 'p?BCA[A-Z]?[0-9][0-9][0-9][A-Z]?[0-9]?[A-Z, a-z]?';</code> was run via SQL in order to the verify the ORF counts (compared to the results that were found using XMLPipeDB match, see [[Blitvak Week 14|the week 14 assignment]].
**7121 counts were found which lines up with what was found through XMLPipeDB match and through an analysis of the XML file
*At this point, it was assumed that the data in the genenametype table of the PSQL database is identical to what was within the same table in the initial export PSQL database.
*Are your results the same as reported by the TallyEngine? Why or why not?
**The "ordered locus" results are the same as what was reported by TallyEngine since both are focusing on the same set of data.

==OriginalRowCounts Comparison==

*The newly created J2315 .gdb file (Bc-Std_GEN_BL12_20151201.gdb) was opened with a program that is able to explore a .mdb file (such as Microsoft Access); in this case, [http://www.alexnolan.net/software/mdb_viewer_plus.htm MDB Viewer Plus] was utilized.
*Using the program, the OriginalRowCounts table was looked at, which contained summaries regarding each of the tables within the database (and the # of rows/entries in each of the tables)
*'''OriginalRowCounts for Build 2 Export of J2315'''
**[[File:Build2ExportOriginalRowCounts GEN BL14 20151201.png]]
*It was decided that a good reference or "benchmark" would be the database that was created for the initial, dry, export of the gene data related to J2315; comparing the two should bring to light any differences that are the result of the export.
*Benchmark .gdb file: [[Media:Bc-Std GEN BL12 20151119.zip |compressed Bc-Std_GEN_BL12_20151119.gdb]]
*'''OriginalRowCounts for Initial Export of J2315'''
**[[File:OriginalRowCounts(initial_export)_GEN_BL12_20151123.png]]

Note: It was noticed that the OriginalRowCounts table in this export is identical to the one found in the [[GÉNialOMICS Gene Database Testing Report (Initial Export)|initial export]]; this is not surprising since the utilized GenMAPP builder is mostly identical to one that was utilized in the initial export.

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.

* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
**Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
*In the UniProt table, it appears that the ID and EntryName columns involve the correct ID form for J2315. The GeneName column in UniProt, however, appears to be missing most of its entries. No gene names in the basic form of '''BCA[S,L,M]####''' and '''BceJ2315_#####''' can be found. Very few gene names are present, and those present are in the form of either four letters with the final letter being capital or in the form of three uncapitalized letters.
*RefSeq table appears to be in order
*OrderedLocusNames table, as suggested by earlier analysis, only contains 337 rows and IDs in the format of '''BceJ2315_#####'''.

Note: It was noticed that the UniProt table only contained gene names that are either of "ordered locus" type or in the format of four letters.

==.gdb Use in GenMAPP==

Note: To do.

==Compare Gene Database to Outside Resource==
Outside Resource: [http://beta.burkholderia.com/ Burkholderia Genome DB]
*The strain page for J2315 was looked up: [http://beta.burkholderia.com/strain/show/146]
*Only 337 OrderedLocusNames IDs were found in the exported database; 7384 annotated genes, however, are present in the MOD
**[[File:J2315MODGENES_GEN_BL12_20151123.png]]

Note: A lot more OrderedLocusNames IDs should be present in the exported database than the counts that were found. Data on the MOD and executed Match queries help confirm this. Current number of OrderedLocusNames (337) is very far from the numbers that was seen in the MOD (7384 annotated genes, with 7114 involved with the coding of protein). The count is so low, it is now known, due to the fact that GenMAPP builder, at the moment, is programmed to only pick-up "ordered locus" data from the XML; most of the gene names reside as "ORF" data, which explains the fact that most of the data is not present in the export.

----
{{Template:GÉNialOMICS}}
{{Template:blitvak}}