LMU BioDB 2015 - User contributions [en]

OTS Deliverables

2015-12-19T00:23:33Z

Troque: /* Individual Reflections */ Linking pdf

{{Template:Oregon Trail Survivors}}
==OTS Group Files and Datasets==

[[Media:GenMAPP Builder 12 14 2015 Number 2.zip | Gene Database .gdb]]

[[Media:ReadMe Sf-Std External 20151214.pdf | ReadMe]]

[[Media:ShigellaGeneDatabaseSchema.pdf | Gene Database Schema]]

[[Media:Gene Database Testing Report for Shigella flexneri 2a str 301.pdf | Gene Database Testing Report (.pdf)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.xlsx | Compiled Raw Microarray Dataset (.xlsx)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.txt | Data Used for Import into GenMAPP (.txt)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210 (1).gex | GenMAPP Expression Dataset File (.gex)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.EX.txt | Exceptions file (.EX.txt)]]

[[Media:Criterion.GOfiles.zip | Raw MAPPFinder results files (-GO.txt)]]

[[Media:CompiledRaw data RPRX IDLR EYKZ2015126.gmf | .gmf file]]

[[Media:Filtered MAPPFinder Results.xlsx | Filtered MAPPFinder Results .xlsx]]

[[Media:MAPPFinderResults.zip | Filtered MAPPFinder Results (common GO terms highlighted) .png]]

[[Media:RPRX MAPPs.zip | .zip of .mapp s of relevant genes]]

[[Media:OTSDeliverables.docx | Final Group Report .docx]]

[[Media:FinalOTSPresentation.pptx | Final PowerPoint Presentation]]

==Individual Reflections==

[[Kzebrows Individual Reflection | Kristin Zebrowski]]

[[Eyanosch Individual Reflection | Erich Yanoschik]]

[[Jwoodlee Individual Reflection | Jake Woodlee]]

[[Media:Final Project Reflection OTS TR 20151218.pdf | Trixie Roque]]

File:Final Project Reflection OTS TR 20151218.pdf

2015-12-19T00:23:14Z

Troque: Uploading pdf form

Uploading pdf form

OTS Deliverables

2015-12-19T00:22:09Z

Troque: /* Individual Reflections */ Linking word document

{{Template:Oregon Trail Survivors}}
==OTS Group Files and Datasets==

[[Media:GenMAPP Builder 12 14 2015 Number 2.zip | Gene Database .gdb]]

[[Media:ReadMe Sf-Std External 20151214.pdf | ReadMe]]

[[Media:ShigellaGeneDatabaseSchema.pdf | Gene Database Schema]]

[[Media:Gene Database Testing Report for Shigella flexneri 2a str 301.pdf | Gene Database Testing Report (.pdf)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.xlsx | Compiled Raw Microarray Dataset (.xlsx)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.txt | Data Used for Import into GenMAPP (.txt)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210 (1).gex | GenMAPP Expression Dataset File (.gex)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.EX.txt | Exceptions file (.EX.txt)]]

[[Media:Criterion.GOfiles.zip | Raw MAPPFinder results files (-GO.txt)]]

[[Media:CompiledRaw data RPRX IDLR EYKZ2015126.gmf | .gmf file]]

[[Media:Filtered MAPPFinder Results.xlsx | Filtered MAPPFinder Results .xlsx]]

[[Media:MAPPFinderResults.zip | Filtered MAPPFinder Results (common GO terms highlighted) .png]]

[[Media:RPRX MAPPs.zip | .zip of .mapp s of relevant genes]]

[[Media:OTSDeliverables.docx | Final Group Report .docx]]

[[Media:FinalOTSPresentation.pptx | Final PowerPoint Presentation]]

==Individual Reflections==

[[Kzebrows Individual Reflection | Kristin Zebrowski]]

[[Eyanosch Individual Reflection | Erich Yanoschik]]

[[Jwoodlee Individual Reflection | Jake Woodlee]]

[[Media:Final Project Reflection OTS TR 20151218.docx | Trixie Roque]]

File:Final Project Reflection OTS TR 20151218.docx

2015-12-19T00:21:39Z

Troque: Uploading final stuff :)

Uploading final stuff :)

Gene Database Project Deliverables

2015-12-18T23:22:15Z

Troque: /* Individual Assessment and Reflection */ Removed accidentally added entry

{{Gene Database Project Links}}

== Group Report ==

These guidelines are based on the [https://peerj.com/about/author-instructions/ Instructions for Authors] issued by the [https://peerj.com/computer-science/ PeerJ Computer Science] journal. We have made this choice so that, if a group report is considered to be of sufficient quality, we can pursue publication of this report in ''PeerJ Computer Science'' as smoothly as possible. If there are formatting or detail questions that are not covered here, visit the [https://peerj.com/about/author-instructions/ Instructions for Authors] and follow their guidance.

* The report should be written with contributions from all group members.
* Submit as ''.doc'', ''.docx'' or ''.pdf'' file.

=== Style Sheet ===

Use the following guidelines when formatting your report:
* 2.54 cm (1 in) margins on all sides
* Double-spaced
* 12 point Times/Times New Roman font
* Number the pages on the lower-right corner
* Use left justification (“jagged” on the right side)

=== Cover Page ===

Include the following information in a standalone cover page:
* A descriptive title for your project
** The function of the title is to identify the main result or take-home message of the paper. It should be as specific as possible and name the organism. It can be a phrase or a sentence. What is the main result of your paper that you want to convey with the title?
* The names of the team members (with middle initials)
* The course number and title of the class
* The date of submission

=== Abstract ===

Provide an abstract of no more than 500 words.

=== Introduction ===

The introduction gives the background information necessary to understand your report. The introduction should be in the form of a logical argument that “funnels” from broad to narrow:

<gallery mode="nolines" widths=322px heights=256px>
Funnel.png
</gallery>

* States importance of the problem
Why is this species important?
* States what is known about the problem
- Give an overview of what is known about your species' genome from your [[Week 11|journal club outline and presentation]].
- Introduce the DNA microarray experiment that was performed on your species from your [[Week 11|journal club outline and presentation]].
* States what is unknown about the problem
You want to analyze the data with GenMAPP/MAPPFinder, but can't because there is no Gene Database for your species.
* States clues that suggest how to approach the unknown
Introduce XMLPipeDB and GenMAPP Builder as the answer to this problem.
* States the question the paper is trying to address
In this case you want to discover new information about the microarray data using GenMAPP.

=== Materials & Methods ===

This section will summarize the entire workflow for the project. This needs to be a '''''narrative description''''' of what your team actually did, but not a step-by-step protocol. We are following the standards of reproducible research such that someone else with the appropriate expertise could reproduce what you did given the information in your Materials and Methods section. You can consider your audience to be the fellow members of your class.
# Download the UniProt XML proteome set and GOA (GO association) files for your species.
#* Note the date of download and the version of the files.
# Download GO terms from in the OBO-XML format.
#* Note the date of download and the version of the files.
# Create the GenMAPP Builder tables in PostgreSQL.
# Load files into PostgreSQL database via GenMAPP Builder.
# Export into a GenMAPP Gene Database.
# Inspect/vet/validate Gene Database.
# Prepare microarray data (organize, normalize, perform statistical analysis)
# Run GenMAPP using the Gene Database.
#* Microarray data (import using Expression Dataset Manager)
#* Run MAPPFinder analysis
#* Place genes on MAPP and draw pathway

=== Results ===

This section will summarize the results of the project. This section will include figures, tables, and a '''''narrative description''''' of the results shown in those figures and tables. You should:
* Number each of the figures sequentially and number each of the tables sequentially in order from first mention in the text. You can either embed your figures and tables in the appropriate place in the text or put them all at the end. Do not mix both styles, however.
* Write a descriptive legend for each figure and table that briefly states what the figure/table is and gives a brief key to any labels and abbreviations.
* Gene Database Schema figure
* Gene Database Testing Report on final version of Gene Database (can be put at the end of the report as an Appendix)
* A table that summarizes how many OrderedLocusNames IDs were found
** by XMLPipeDB match in the UniProt XML file
** by TallyEngine in the UniProt XML file
** by TallyEngine in the PostgreSQL database
** in the OriginalRowCounts table in the gdb
** in your external model organism database source
* Give the command used in match to generate these results
* Give the query used in PGAdmin III to generate these results
* Include a screenshot of the TallyEngine results as a figure
* Report on quantity and identity of gene IDs that did not make it into the database
*# OrderedLocusNames IDs that were not in the XML source at all
*# OrderedLocusNames IDs that were in the XML source but did not get imported into Postgres
*# OrderedLocusNames IDs that were in Postgres but did not get exported to the GenMAPP Gene Database
* Report on what changes were made to the GenMAPP Builder code in order to to accommodate the second and third type of missing gene IDs and the result of those changes
* Report results of the DNA microarray analysis
** Include a table that shows the results of your "Sanity Check", i.e., how many genes were significantly increased and decreased at different p value cut-offs in the dataset?
** Include the criteria you used for a significant increase and decrease in expression for your GenMAPP Expression Dataset
** Table of filtered MAPPFinder results (from ''.xls'' or ''.xls'')
*** Show a list of 15-20 non-redundant GO terms.
*** Include in your table the GO ID, the name of the GO term, the number changed/number present and the percent (e.g., 10/20 (50%)), the number present/number in GO and the percent, the regular p value and adjusted p value.
*** Write a paragraph interpreting the GO results in light of the experiment performed in the published paper.
** GenMAPP MAPP of a pathway relevant to your results

=== Discussion ===

* How well did the GenMAPP Builder process work for your species (just comment on the technical aspects here, you will discuss the teamwork/process aspects in your individual assessment).
* Discuss the statistical analysis and MAPPFinder results for your microarray dataset. Compare it to what was reported in the original paper from which you got the microarray data.
** In particular, compare directly the log fold change value of a couple of key genes mentioned in the paper with what you found for those genes.
** Compare the criteria the journal article used for a significant expression change to the criteria that you used. How many genes met the criterion for the article vs. how many met the criterion for your analysis.

=== Conclusions ===

Write a concluding paragraph that summarizes the overall project and your findings.
* How closely do your findings correspond to the original study?
* Are there significant differences?
* Did you discover anything new?
* What future directions would you take if you were to continue this project?

=== Acknowledgments ===

Write a short paragraph acknowledging the assistance of anyone who is not a member of your team.

=== References ===

* This section lists all of the references cited in the text of the report (and only those references cited in the paper). Follow the [[Media:BIOL367_Fall2015_GuidelinesforLiteratureCitations.pdf | Guidelines for Literature Citations in a Scientific Paper]] handout for general principles.
* Remember that you need to cite anything for which you are not the original source. Generally, in the introduction, you should aim for a minimum of two in-text citations per paragraph. You may reference the course web site using the appropriate format for a web reference.
* List your references in alphabetical order by first author using [https://peerj.com/about/author-instructions/#reference-format PeerJ’s recommended reference format]. This format is very similar to APA style and should feel familiar if you have written research papers before.
* To minimize busy work, the PeerJ website includes links to downloadable style files for [https://www.zotero.org/styles/?q=peerj Zotero] and [http://endnote.com/downloads/style/peerj EndNote], if you use either system for managing and rendering references.

== PowerPoint Presentation ==

Each team of students will prepare and give a 20 minute PowerPoint presentation to report the results of their project on Tuesday, December 18 at 2:00-4:00 PM.
* Please follow the [[Media:PresentationGuidelines.ppt | Presentation Guidelines]] for how to format your slides.
* You will need to prepare ~20 slides (assume 1 slide per minute of presentation) and include the following content:
# Background on your species and your species' genome from the genome paper presentation.
# The results of the Gene Database creation
#* Gene Database Schema figure
#* A table that summarizes how many OrderedLocusNames IDs were found
#** by XMLPipeDB match in the UniProt XML file
#** by TallyEngine in the UniProt XML file
#** by TallyEngine in the PostgreSQL database
#** in the OriginalRowCounts table in the gdb
#** in your external model organism database source
#* Give the command used in match to generate these results
#* Give the query used in PGAdmin III to generate these results
#* Include a screenshot of the TallyEngine results as a figure
#* Report on quantity and identity of gene IDs that did not make it into the database
#*# OrderedLocusNames IDs that were not in the XML source at all
#*# OrderedLocusNames IDs that were in the XML source but did not get imported into Postgres
#*# OrderedLocusNames IDs that were in Postgres but did not get exported to the GenMAPP Gene Database
#* Report on what changes were made to the GenMAPP Builder code in order to to accommodate the second and third type of missing gene IDs and the result of those changes
# Introduce the experiment performed in the microarray paper, including the experimental design flow chart
# Report results of the DNA microarray analysis
#* Include a table that shows the results of your "Sanity Check", i.e., how many genes were significantly increased and decreased at different p value cut-offs in the dataset?
#* Include the criteria you used for a significant increase and decrease in expression for your GenMAPP Expression Dataset
#* Table of filtered MAPPFinder results (from ''.xls'' or ''.xls'')
#** Show a list of 15-20 non-redundant GO terms.
#** Include in your table the GO ID, the name of the GO term, the number changed/number present and the percent (e.g., 10/20 (50%)), the number present/number in GO and the percent, the regular p value and adjusted p value.
#* GenMAPP MAPP of a pathway relevant to your results
* '''''Your PowerPoint slides must be uploaded to the wiki and linked to from your individual journal page and your team page by midnight, Tuesday, December 15.'''''
** You can update your slides before your presentation, but we will be grading the ones you upload by the deadline.
* Your presentation (both the slides and the oral presentation) will be evaluated by the instructors using the [[Presentation Rubric]].
* Your presentation will also be evaluated by your fellow classmates (anonymously) who will answer the following questions:
*# What is the speaker's take-home message (one short sentence)?
*# What are the best points about the presentation's organization, visuals, and delivery? Please give at least 2 specific examples.
*# What points need improvement? Please give at least 2 specific examples.
* We expect that you will take the feedback from your previous presentation into account when doing this presentation.

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:GenMAPP_schema_generic_bacteria_20151210.zip | GenMAPP_schema_generic_bacteria_20151210.zip]] 
*** Sample schema in jpeg format: [[Media:GenMAPP_schema_generic_bacteria_20151210.jpg | GenMAPP_schema_generic_bacteria_20151210.jpg]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

== Individual Assessment and Reflection ==

Each person on the team will complete an assessment and reflection ''individually''. If you are comfortable with making this assessment publicly available, you may write it up as a wiki page or as a Word document uploaded to your group deliveables page. If you prefer to communicate your assessment privately, then email this to both Drs. Dahlquist and Dionisio.

=== Statement of Work ===

* Describe exactly what you did on the project.
* Provide references or links to artifacts of your work, such as:
** Wiki pages
** Other files or documents
** Code or scripts

=== Assessment of Project ===

* Give an objective assessment of the success of your project workflow and teamwork.
* What worked and what didn't work?
* What would you do differently if you could do it all over again?
* Evaluate the Gene Database Project and Group Report in the following areas:
*# Content: What is the quality of the work?
*# Organization: Comment on the organization of the project and of your group's wiki pages.
*# Completeness: Did your team achieve all of the project objectives? Why or why not?

=== Reflection on the Process ===

* What did you learn?
** With your head (biological or computer science principles)
** With your heart (personal qualities and teamwork qualities that make things work or not work)?
** With your hands (technical skills)?
* What lesson will you take away from this project that you will still use a year from now?

{{Gene Database Project Links}}

[[Category:Group Projects]]

Gene Database Project Deliverables

2015-12-18T23:21:27Z

Troque: /* Statement of Work */ Added some entry

{{Gene Database Project Links}}

== Group Report ==

These guidelines are based on the [https://peerj.com/about/author-instructions/ Instructions for Authors] issued by the [https://peerj.com/computer-science/ PeerJ Computer Science] journal. We have made this choice so that, if a group report is considered to be of sufficient quality, we can pursue publication of this report in ''PeerJ Computer Science'' as smoothly as possible. If there are formatting or detail questions that are not covered here, visit the [https://peerj.com/about/author-instructions/ Instructions for Authors] and follow their guidance.

* The report should be written with contributions from all group members.
* Submit as ''.doc'', ''.docx'' or ''.pdf'' file.

=== Style Sheet ===

Use the following guidelines when formatting your report:
* 2.54 cm (1 in) margins on all sides
* Double-spaced
* 12 point Times/Times New Roman font
* Number the pages on the lower-right corner
* Use left justification (“jagged” on the right side)

=== Cover Page ===

Include the following information in a standalone cover page:
* A descriptive title for your project
** The function of the title is to identify the main result or take-home message of the paper. It should be as specific as possible and name the organism. It can be a phrase or a sentence. What is the main result of your paper that you want to convey with the title?
* The names of the team members (with middle initials)
* The course number and title of the class
* The date of submission

=== Abstract ===

Provide an abstract of no more than 500 words.

=== Introduction ===

The introduction gives the background information necessary to understand your report. The introduction should be in the form of a logical argument that “funnels” from broad to narrow:

<gallery mode="nolines" widths=322px heights=256px>
Funnel.png
</gallery>

* States importance of the problem
Why is this species important?
* States what is known about the problem
- Give an overview of what is known about your species' genome from your [[Week 11|journal club outline and presentation]].
- Introduce the DNA microarray experiment that was performed on your species from your [[Week 11|journal club outline and presentation]].
* States what is unknown about the problem
You want to analyze the data with GenMAPP/MAPPFinder, but can't because there is no Gene Database for your species.
* States clues that suggest how to approach the unknown
Introduce XMLPipeDB and GenMAPP Builder as the answer to this problem.
* States the question the paper is trying to address
In this case you want to discover new information about the microarray data using GenMAPP.

=== Materials & Methods ===

This section will summarize the entire workflow for the project. This needs to be a '''''narrative description''''' of what your team actually did, but not a step-by-step protocol. We are following the standards of reproducible research such that someone else with the appropriate expertise could reproduce what you did given the information in your Materials and Methods section. You can consider your audience to be the fellow members of your class.
# Download the UniProt XML proteome set and GOA (GO association) files for your species.
#* Note the date of download and the version of the files.
# Download GO terms from in the OBO-XML format.
#* Note the date of download and the version of the files.
# Create the GenMAPP Builder tables in PostgreSQL.
# Load files into PostgreSQL database via GenMAPP Builder.
# Export into a GenMAPP Gene Database.
# Inspect/vet/validate Gene Database.
# Prepare microarray data (organize, normalize, perform statistical analysis)
# Run GenMAPP using the Gene Database.
#* Microarray data (import using Expression Dataset Manager)
#* Run MAPPFinder analysis
#* Place genes on MAPP and draw pathway

=== Results ===

This section will summarize the results of the project. This section will include figures, tables, and a '''''narrative description''''' of the results shown in those figures and tables. You should:
* Number each of the figures sequentially and number each of the tables sequentially in order from first mention in the text. You can either embed your figures and tables in the appropriate place in the text or put them all at the end. Do not mix both styles, however.
* Write a descriptive legend for each figure and table that briefly states what the figure/table is and gives a brief key to any labels and abbreviations.
* Gene Database Schema figure
* Gene Database Testing Report on final version of Gene Database (can be put at the end of the report as an Appendix)
* A table that summarizes how many OrderedLocusNames IDs were found
** by XMLPipeDB match in the UniProt XML file
** by TallyEngine in the UniProt XML file
** by TallyEngine in the PostgreSQL database
** in the OriginalRowCounts table in the gdb
** in your external model organism database source
* Give the command used in match to generate these results
* Give the query used in PGAdmin III to generate these results
* Include a screenshot of the TallyEngine results as a figure
* Report on quantity and identity of gene IDs that did not make it into the database
*# OrderedLocusNames IDs that were not in the XML source at all
*# OrderedLocusNames IDs that were in the XML source but did not get imported into Postgres
*# OrderedLocusNames IDs that were in Postgres but did not get exported to the GenMAPP Gene Database
* Report on what changes were made to the GenMAPP Builder code in order to to accommodate the second and third type of missing gene IDs and the result of those changes
* Report results of the DNA microarray analysis
** Include a table that shows the results of your "Sanity Check", i.e., how many genes were significantly increased and decreased at different p value cut-offs in the dataset?
** Include the criteria you used for a significant increase and decrease in expression for your GenMAPP Expression Dataset
** Table of filtered MAPPFinder results (from ''.xls'' or ''.xls'')
*** Show a list of 15-20 non-redundant GO terms.
*** Include in your table the GO ID, the name of the GO term, the number changed/number present and the percent (e.g., 10/20 (50%)), the number present/number in GO and the percent, the regular p value and adjusted p value.
*** Write a paragraph interpreting the GO results in light of the experiment performed in the published paper.
** GenMAPP MAPP of a pathway relevant to your results

=== Discussion ===

* How well did the GenMAPP Builder process work for your species (just comment on the technical aspects here, you will discuss the teamwork/process aspects in your individual assessment).
* Discuss the statistical analysis and MAPPFinder results for your microarray dataset. Compare it to what was reported in the original paper from which you got the microarray data.
** In particular, compare directly the log fold change value of a couple of key genes mentioned in the paper with what you found for those genes.
** Compare the criteria the journal article used for a significant expression change to the criteria that you used. How many genes met the criterion for the article vs. how many met the criterion for your analysis.

=== Conclusions ===

Write a concluding paragraph that summarizes the overall project and your findings.
* How closely do your findings correspond to the original study?
* Are there significant differences?
* Did you discover anything new?
* What future directions would you take if you were to continue this project?

=== Acknowledgments ===

Write a short paragraph acknowledging the assistance of anyone who is not a member of your team.

=== References ===

* This section lists all of the references cited in the text of the report (and only those references cited in the paper). Follow the [[Media:BIOL367_Fall2015_GuidelinesforLiteratureCitations.pdf | Guidelines for Literature Citations in a Scientific Paper]] handout for general principles.
* Remember that you need to cite anything for which you are not the original source. Generally, in the introduction, you should aim for a minimum of two in-text citations per paragraph. You may reference the course web site using the appropriate format for a web reference.
* List your references in alphabetical order by first author using [https://peerj.com/about/author-instructions/#reference-format PeerJ’s recommended reference format]. This format is very similar to APA style and should feel familiar if you have written research papers before.
* To minimize busy work, the PeerJ website includes links to downloadable style files for [https://www.zotero.org/styles/?q=peerj Zotero] and [http://endnote.com/downloads/style/peerj EndNote], if you use either system for managing and rendering references.

== PowerPoint Presentation ==

Each team of students will prepare and give a 20 minute PowerPoint presentation to report the results of their project on Tuesday, December 18 at 2:00-4:00 PM.
* Please follow the [[Media:PresentationGuidelines.ppt | Presentation Guidelines]] for how to format your slides.
* You will need to prepare ~20 slides (assume 1 slide per minute of presentation) and include the following content:
# Background on your species and your species' genome from the genome paper presentation.
# The results of the Gene Database creation
#* Gene Database Schema figure
#* A table that summarizes how many OrderedLocusNames IDs were found
#** by XMLPipeDB match in the UniProt XML file
#** by TallyEngine in the UniProt XML file
#** by TallyEngine in the PostgreSQL database
#** in the OriginalRowCounts table in the gdb
#** in your external model organism database source
#* Give the command used in match to generate these results
#* Give the query used in PGAdmin III to generate these results
#* Include a screenshot of the TallyEngine results as a figure
#* Report on quantity and identity of gene IDs that did not make it into the database
#*# OrderedLocusNames IDs that were not in the XML source at all
#*# OrderedLocusNames IDs that were in the XML source but did not get imported into Postgres
#*# OrderedLocusNames IDs that were in Postgres but did not get exported to the GenMAPP Gene Database
#* Report on what changes were made to the GenMAPP Builder code in order to to accommodate the second and third type of missing gene IDs and the result of those changes
# Introduce the experiment performed in the microarray paper, including the experimental design flow chart
# Report results of the DNA microarray analysis
#* Include a table that shows the results of your "Sanity Check", i.e., how many genes were significantly increased and decreased at different p value cut-offs in the dataset?
#* Include the criteria you used for a significant increase and decrease in expression for your GenMAPP Expression Dataset
#* Table of filtered MAPPFinder results (from ''.xls'' or ''.xls'')
#** Show a list of 15-20 non-redundant GO terms.
#** Include in your table the GO ID, the name of the GO term, the number changed/number present and the percent (e.g., 10/20 (50%)), the number present/number in GO and the percent, the regular p value and adjusted p value.
#* GenMAPP MAPP of a pathway relevant to your results
* '''''Your PowerPoint slides must be uploaded to the wiki and linked to from your individual journal page and your team page by midnight, Tuesday, December 15.'''''
** You can update your slides before your presentation, but we will be grading the ones you upload by the deadline.
* Your presentation (both the slides and the oral presentation) will be evaluated by the instructors using the [[Presentation Rubric]].
* Your presentation will also be evaluated by your fellow classmates (anonymously) who will answer the following questions:
*# What is the speaker's take-home message (one short sentence)?
*# What are the best points about the presentation's organization, visuals, and delivery? Please give at least 2 specific examples.
*# What points need improvement? Please give at least 2 specific examples.
* We expect that you will take the feedback from your previous presentation into account when doing this presentation.

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:GenMAPP_schema_generic_bacteria_20151210.zip | GenMAPP_schema_generic_bacteria_20151210.zip]] 
*** Sample schema in jpeg format: [[Media:GenMAPP_schema_generic_bacteria_20151210.jpg | GenMAPP_schema_generic_bacteria_20151210.jpg]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

== Individual Assessment and Reflection ==

Each person on the team will complete an assessment and reflection ''individually''. If you are comfortable with making this assessment publicly available, you may write it up as a wiki page or as a Word document uploaded to your group deliveables page. If you prefer to communicate your assessment privately, then email this to both Drs. Dahlquist and Dionisio.

=== Statement of Work ===

* Describe exactly what you did on the project.
** As the quality assurance member of the group, I was responsible for identifying the valid IDs that needed to be exported in the customized gene database for ''Shigella flexneri''. I was tasked with finding IDs that exist within the UniProt XML file that were not exported into the .gdb file that our GenMAPP Users would use on their end. I aided the Coder on what to put on the customized species profile and also tested each build that he made to ensure that we actually captured the IDs we need and that we did not break our existing builds. I would record the results of the builds into our Gene Database Testing Report.
* Provide references or links to artifacts of your work, such as:
**
** Wiki pages
** Other files or documents
** Code or scripts

=== Assessment of Project ===

* Give an objective assessment of the success of your project workflow and teamwork.
* What worked and what didn't work?
* What would you do differently if you could do it all over again?
* Evaluate the Gene Database Project and Group Report in the following areas:
*# Content: What is the quality of the work?
*# Organization: Comment on the organization of the project and of your group's wiki pages.
*# Completeness: Did your team achieve all of the project objectives? Why or why not?

=== Reflection on the Process ===

* What did you learn?
** With your head (biological or computer science principles)
** With your heart (personal qualities and teamwork qualities that make things work or not work)?
** With your hands (technical skills)?
* What lesson will you take away from this project that you will still use a year from now?

{{Gene Database Project Links}}

[[Category:Group Projects]]

OTS Deliverables

2015-12-18T20:26:57Z

Troque: /* OTS Group Files and Datasets */ Linking gene database report

{{Template:Oregon Trail Survivors}}
==OTS Group Files and Datasets==

[[Media:GenMAPP Builder 12 14 2015 Number 2.zip | Gene Database .gdb]]

[[Media:ReadMe Sf-Std External 20151214.pdf | ReadMe]]

[[Media:ShigellaGeneDatabaseSchema.pdf | Gene Database Schema]]

[[Media:Gene Database Testing Report for Shigella flexneri 2a str 301.pdf | Gene Database Testing Report (.pdf)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.xlsx | Compiled Raw Microarray Dataset (.xlsx)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.txt | Data Used for Import into GenMAPP (.txt)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210 (1).gex | GenMAPP Expression Dataset File (.gex)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.EX.txt | Exceptions file (.EX.txt)]]

[[Media:Criterion.GOfiles.zip | Raw MAPPFinder results files (-GO.txt)]]

[[Media:CompiledRaw data RPRX IDLR EYKZ2015126.gmf | .gmf file]]

[[Media:Filtered MAPPFinder Results.xlsx | Filtered MAPPFinder Results .xlsx]]

[[Media:MAPPFinderResults.zip | Filtered MAPPFinder Results (common GO terms highlighted) .png]]

[[Media:RPRX MAPPs.zip | .zip of .mapp s of relevant genes]]

Group Report (.doc or .pdf)

[[Media:FinalOTSPresentation.pptx | Final PowerPoint Presentation]]

==Individual Reflections==

[[Kzebrows Individual Reflection | Kristin Zebrowski]]

[[Eyanosch Individual Reflection | Erich Yanoschik]]

[[Jwoodlee Individual Reflection | Jake Woodlee]]

[[Troque Individual Reflection | Trixie Roque]]

File:Gene Database Testing Report for Shigella flexneri 2a str 301.pdf

2015-12-18T20:26:20Z

Troque: Uploading gene database report

Uploading gene database report

Troque Individual Reflection

2015-12-18T19:40:51Z

Troque: Creating this page

== Statement of Work ==

== Assessment of Project ==

== Reflection on the Project ==

OTS Deliverables

2015-12-18T19:39:25Z

Troque: /* Individual Reflections */ Linked Trixie Roque individual reflection

{{Template:Oregon Trail Survivors}}
==OTS Group Files and Datasets==

[[Media:GenMAPP Builder 12 14 2015 Number 2.zip | Gene Database .gdb]]

[[Media:ReadMe Sf-Std External 20151214.pdf | ReadMe]]

[[Media:ShigellaGeneDatabaseSchema.pdf | Gene Database Schema]]

Gene Database Testing Report (.pdf)

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.xlsx | Compiled Raw Microarray Dataset (.xlsx)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.txt | Data Used for Import into GenMAPP (.txt)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210 (1).gex | GenMAPP Expression Dataset File (.gex)]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.EX.txt | Exceptions file (.EX.txt)]]

[[Media:Criterion.GOfiles.zip | Raw MAPPFinder results files (-GO.txt)]]

[[Media:CompiledRaw data RPRX IDLR EYKZ2015126.gmf | .gmf file]]

[[Media:Filtered MAPPFinder Results.xlsx | Filtered MAPPFinder Results .xlsx]]

[[Media:MAPPFinderResults.zip | Filtered MAPPFinder Results (common GO terms highlighted) .png]]

[[Media:RPRX MAPPs.zip | .zip of .mapp s of relevant genes]]

Group Report (.doc or .pdf)

[[Media:FinalOTSPresentation.pptx | Final PowerPoint Presentation]]

==Individual Reflections==

[[Kzebrows Individual Reflection | Kristin Zebrowski]]

[[Eyanosch Individual Reflection | Erich Yanoschik]]

[[Jwoodlee Individual Reflection | Jake Woodlee]]

[[Troque Individual Reflection | Trixie Roque]]

File:QA files OTS 20151216.zip

2015-12-16T21:06:25Z

Troque: Uploading new QA files

Uploading new QA files

OTS Deliverables

2015-12-16T20:43:41Z

Troque: /* OTS Group Files and Datasets */ Linking ReadMe

==OTS Group Files and Datasets==

[[Media:GenMAPP Builder 12 14 2015 Number 2.zip | Gene Database .gdb]]

[[Media:ReadMe Sf-Std External 20151214.pdf | ReadMe]]

Gene Database Schema

Gene Database Testing Report (.pdf)

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.xlsx | Final Compiled Raw Data .xlsx]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210.txt | Final Compiled Raw Data .txt]]

[[Media:FINAL CompiledRawData RXRP EYKZ20151210 (1).gex | Final Compiled Raw Data .gex]]

==OTS Files==

[[Media:Micro Array Shigella Flexneri 20151011.pdf | Shigella Flexneri Microarray Paper (PDF)]]

[[Media:Shigellamicroarray.pptx | Microarray Journal Club Power Point]]

[[Media:CompiledRaw data RPRX IDLR EYKZ20152211.xlsx | Microarray Compiled Raw Data RP/RX IDLR]]

[[Media:SamplesFilesCorrespondanceTable SF301a EYKZ201522111.xls | Microarray Corresponding Files Table]]

[[Media: GMBuilder Shigella flexneri.zip]]

[[Media: QA Files.zip | Download QA files]]

[[Media:GMBuilder December7 2015 build 2.zip]]

[[Media:GenMAPP Builder 12 14 2015.zip]]

[[Media:FinalOTSPresentation.pptx | '''Final PowerPoint Presentation''']]

==GenMAPP User Files==
[[Media:CompiledRaw data RPRX IDLR EYKZ2015121.xlsx | ScalingCentering file 12/1 .xlsx]]

[[Media:CompiledRaw data RPRX IDLR EYKZ2015126.gmf | Compiled Raw Data 12/8 .gmf]]

[[Media:Filtered MAPPFinder Results.xlsx | Filtered MAPPFinder Results .xlsx]]

[[Media:MAPPFinderResults.zip | Filtered MAPPFinder Results (common GO terms highlighted) .png]]

[[Media:Flagellum Ribosomal Mapp 1 60min 20151012.jpg | RP vs RX 1 MIC @ 60 minutes MAPP 12/12 .jpg]]

[[Media:Flagellum Ribosomal Mapp 0pt5 10min 20151012.jpg | RP vs RX 0.5 MIC @ 10 minutes MAPP 12/12 .jpg]]

====RP (Erich)====
[[Media:CompiledRaw data GenMAPP Final RP IDLR EYKZ2015126.xlsx | RP Compiled Raw Data Final 12/10]]

[[Media:CompiledRaw data GenMAPP ready RP IDLR EYKZ2015126.txt | RP .txt format GenMAPP ready 12/6]]

[[Media:CompiledRaw data Errors RP EYKZ2015126.EX.xlsx | RP Exceptions file in Excel format (filtered)]]

[[Media:CompiledRaw data Errors RP EYKZ2015126.EX.txt | RP Exceptions (txt)]]

[[Media:CompiledRaw data GenMAPP ready RP IDLR EYKZ2015126.gex | RP .gex file]]

====RX (Kristin)====
[[Media:CompiledRaw data RPRX IDLR EYKZ2015126.xlsx | RX Compiled Raw Data 12/6]]

[[Media:CompiledRaw data statistics BonferroniPvalue RP IDLR EYKZ2015126.txt | RX .txt format GenMAPP ready 12/6]]

[[Media:CompiledRaw data RPRX IDLR EYKZ2015126.xlsx | RX Compiled Raw Data as of 12/6]]

[[Media:CompiledRaw data RPRX IDLR EYKZ2015126.txt | RX .txt format updated as of 12/6]]

[[Media:CompiledRaw data RPRX IDLR EYKZ2015126.EX.txt | RX Exceptions file]]

[[Media:CompiledRaw data RPRX IDLR EYKZ2015126.gex | RX .gex file]]

[[Media:CompiledRaw data RX IDLR KZ2015126.EX.xlsx | RX Exceptions file in Excel format (filtered)]]

{{Template:Oregon Trail Survivors}}

File:ReadMe Sf-Std External 20151214.pdf

2015-12-16T20:42:35Z

Troque: Uploading ReadMe

Uploading ReadMe

Gene Database Testing Report - Oregon Trail Survivors

2015-12-15T08:22:52Z

Troque: /* Export Information (final) */ linked .gdb file

{{Template:Oregon Trail Survivors}}

== Things to note ==
* Taxonomy ID: 623
* UP000001006
* File management system: Wiki

== Initial (Vanilla) Export Information ==
Version of GenMAPP Builder:
* ''' gmbuilder-3.0.0-build-5 '''

Computer on which export was run:
* ''' Front of the room, 3rd computer from the right. '''

Postgres Database name:
* '''Shigella flexneri 20151911'''

UniProt XML filename (give filename and upload and link to compressed file):
* UniProt XML version (The version information can be found at [http://uniprot.org/news the UniProt News Page]): '''UniProt release 2015_11'''
* UniProt XML download link: '''[http://www.uniprot.org/uniprot/?query=proteome:UP000001006 Click here]'''
* Time taken to import: '''4.48 minutes'''
** Note:

GO OBO-XML filename (give filename and upload and link to compressed file):
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped): '''Version created on 11/19/2015 (at 2:24 AM)'''
* GO OBO-XML download link: '''[http://archive.geneontology.org/latest-termdb/go_daily-termdb.obo-xml.gz Click here to download].'''
* Time taken to import: '''7.00 minutes'''
* Time taken to process: '''4.99 minutes'''
** Note:

GOA filename (give filename and upload and link to compressed file):
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''Version released on .'''
* GOA download link: '''[http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/103.S_flexneri_301.goa Click here to download].'''
* Time taken to import: '''0.06 minutes'''
** Note:

Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151119 OTS.gdb | Sf-Std_20151119_OTS.gdb]]'''
* Time taken to export: ''' 1 Hours, 32 Minutes, 33 Seconds '''
** Start time: '''4:06:13 PM PM PDT'''
** End time: '''5:38:46 PM PDT'''
** Note:

== Export Information for Build with Coder Changes # 1 ==
=== Build 1 ===
Name of .gdb file: '''[[Media:Sf-Std 20151201.gdb | Sf-Std_20151201.gdb]]'''
* Time taken to export: ''' '''
** Start time: '''4:19:22 PM PDT'''
** End time: ''' 8:30:08 PM PDT'''
** I have confirmed that the necessary information in the .gdb file exist in the new build (e.g. the URL of the database we are using).

=== TallyEngine ===

* Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
** Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
** Choose the UniProt and GO OBO XML files that was uploaded from the previous sections of this assignment.
** Here is the screenshot of the tally result:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
[[Image:TallyEngine results OTS 112115.jpg]]
</div>

=== Using XMLPipeDB match to Validate the XML Results from the TallyEngine ===
[[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | Follow the instructions found on this page to run XMLPipeDB match.]]
* In the Thawspace directory, I created a folder called "Shigella_flexneri_BioDB_2015" and created subfolders called "Source" and "Working" to store the source files (i.e., the compressed files) and the working files (i.e., the files I will actually be processing).
* As a result, I had to cd to these directories first before using the command for using Match.
** In order to change into the ThawSpace0\Shigella_flexneri_BioDB_2015\Working directory, use the following commands on the command prompt window:
T: && cd "Shigella_flexneri_BioDB_2015\Working"
* The command I used once inside the directory I want is:
java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "SF[0-9][0-9][0-9][0-9]" < uniprot-proteome%3AUP000001006.xml
* The results are as follows:
<div class="center">
[[Image:Match results OTS 112115.jpg]]
</div>
These results did not match up with what the TallyEngine gave (TallyEngine: 7567 vs. Match: 4610)
* As a result, the commands would have to be modified somehow so that the numbers match: <code>(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>
* The overall command to write to a text file is as follows:
java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)" < uniprot-proteome%3AUP000001006.xml > shigella_flexneri_results.txt
* Then our results became:
<div class="center">
[[Image:Match results OTS 20151203 more accurate.jpg]]
</div>
* Observations:
** In order to lessen the number of matches, we had to add the end tag "</name>" to our regular expression. This brought down the number of matches from over 8000, to just 7517. Since TallyEngine's results were 7567, this means that 150 IDs were not being caught. In order to account for this, we had to add the genes with ID's of the form CP#### (there were 50 instances of these), and those with the form SF####.# or S####.#. This led us to get 7566 gene IDs.
** When I looked at the IDs in Microsoft Access, the IDs total 7569. In order to account for this last piece of gene formatting, we also had to account for the genes with the form SF?####/SF?####. These 2 extra genes that were not accounted for by TallyEngine is actually not supposed to be separated since the genes are formatted such that it can be interpreted that the IDs are interchangeable. When the gdb file was created, it would seem that these genes have been split down the "/".
** In other words, there are 3 ordered locus names with formatting that is different from the rest: SF2223/SF2224, S2352/S2353, and S3359/S3360.
** I wasn't able to exactly hit the number outputted by Tally Engine since there are other genes with the same format that were already caught with the patterns SF#### or S####.
** Note: It turns out the ShiBASE database only uses the pattern SF#### instead of S#### or CP#### so the regular expression would really have to be just <code>SF?[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>



=== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine ===
* The command used to count the number of IDs is:
select count(*) from genenametype where type = "ordered locus" and value ~ "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?";
<div class="center">
[[Image:Postgres results OTS 20151203.jpg]]
</div>
* The result above is exactly twice as much as the number of OrderedLocusNames from TallyEngine: 15134 / 2 = 7567 IDs
* A quick peek at the results after the command <code>select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?';</code> is used and the results are exported to Excel reveals that this is because every single entry is entered twice:
<div class="center">
[[Image:Postgres results excel form OTS 20151203.jpg]]
</div>
* Adding the keyword "distinct" would resolve the double counting:
select distinct value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?

=== Analysis ===
* The total number of OrderedLocusNames in TallyEngine is '''7567'''.
* Using the (best I could) regular expression in Match, the result is '''7573'''. The additional 6 IDs emerged since those are originally captured by the regular expression <code>(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?</name></code> and trying to capture the IDs of the form <code>SF?####/SF?####</code> would duplicate those captured IDs.
* The total of entries in PostGreSQL is '''15134''', but this is only because each gene is repeated twice. As a result, diving by 2 would actually yield '''7567'''.
* Microsoft Access yielded '''7569''' in the OrderedLocusNames window. The extra 2 genes came from the IDs of the form <code>SF?####/SF?####</code> since the export broke up the two IDs that represent the same ID.
** 49 are of the form <code>CP####</code>
** 3413 are of the form <code>S####</code>
*** 14 are of the form <code>S####.#</code>
** 4107 are of the form <code>SF####</code>
*** 35 are of the form <code>SF####.#</code>
* Inspecting the UniProt XML file was necessary in identifying the IDs. Looking through what was inside, I discovered (with help from Dondi) that I had to add the end tag "</name>" in order to narrow down the results in Match

== "Export" from Build 2 ==
Name of .gdb file: [[Media:Sf-Std 20151207.gdb | Sf-Std 20151207.gdb]]
* Date: ''' 12/7/15 '''
* Time taken to export: ''' 4 hours, 24 minutes and 1 second '''
** Start time: '''9:13:45 PM PDT'''
** End time: ''' 1:37:46 AM PDT'''
** Note: This export had to be redone since the PSQL database had twice as much entries.

== Export Information (Re-imported) Build 2 ==
Version of GenMAPP Builder:
* ''' gmbuilder-3.0.0-build-5 '''

Computer on which export was run:
* ''' Front of the room, 3rd computer from the right. '''

Postgres Database name:
* '''Shigella_flexneri_20151208'''

UniProt XML filename (give filename and upload and link to compressed file):
* UniProt XML version (The version information can be found at [http://uniprot.org/news the UniProt News Page]): '''UniProt release 2015_11'''
* UniProt XML download link: '''[http://www.uniprot.org/uniprot/?query=proteome:UP000001006 Click here]'''
* Time taken to import: '''4.43 minutes'''
** Note:

GO OBO-XML filename (give filename and upload and link to compressed file):
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped): '''Version created on 11/19/2015 (at 2:24 AM)'''
* GO OBO-XML download link: '''[http://archive.geneontology.org/latest-termdb/go_daily-termdb.obo-xml.gz Click here to download].'''
* Time taken to import: '''6.84 minutes'''
* Time taken to process: '''5.49 minutes'''
** Note:

GOA filename (give filename and upload and link to compressed file):
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''Version released on .'''
* GOA download link: '''[http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/103.S_flexneri_301.goa Click here to download].'''
* Time taken to import: '''0.06 minutes'''
** Note:

Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151208.gdb | Sf-Std 20151208.gdb]]'''
* Time taken to export: '''1 hour, 38 minutes, 42 seconds '''
** Start time: '''4:30:59 PM PDT'''
** End time: '''6:09:41 PM PDT'''
** Note: The reason why I had to re-import everything into a new database is because the one I have been using had some files imported twice. Thus, the reports given by PostGres were all twice as much.

=== Using TallyEngine ===
* The database used is the same one described in the section above: '''Shigella_flexneri_20151208'''
* Notice in the image below that there is an error in the cells. It turns out that we did not even need to add the Ordered Locus since that was the default. We will definitely need to do one last build in order to fix that issue.
<div class="center">
[[Image:Shigella flexneri tallyEngine results build 2.png]]
</div>

=== Using XMLPipeDB match to Validate the XML Results from the TallyEngine ===
<div class="center">
[[Image:regex1_OTS.png]]
</div>

<div class="center">
[[Image:regex2_OTS.png]]
</div>

* When added together, the results becomes 7566 + 3 = 7569.

=== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine ===
* The following command in PostGreSQL resulted in 7567 entries:
select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?';
* The following command resulted in 214 entries:
select value from genenametype where type = 'ORF' and value ~ '(CP|SF?)(_p)?[0-9][0-9][0-9][0-9](\.[0-9])?';

=== OriginalRowCounts Comparison ===
<div class="center">
[[Image:Ms access originalrowcounts.png]]
</div>
* The OrderedLocusNames row seems to report on the same number of IDs as our previous builds

=== Visual Inspection ===
Perform visual inspection of individual tables to see if there are any problems.

* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
** Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
** Yes, all of them seem to follow the same format (there ares more or less, 3 variations on the IDs for each of the tables).

=== Excel Inspection ===
* [[Media:In-search-of-the-missing-ids.xlsx| Excel file]]

=== Observations ===
* Through the use of an XML-reader program, called "firstObject XML Editor", it was discovered that some ordered locus IDs that were exported by GenMAPPBuilder were placed in the same tag:
<div class="center">
[[Image: Dual ordered locus names.png]]
</div>
* These differed from the ones originally captured (7567) since these existed separately in each of the gene/name tags:
<div class="center">
[[Image: Simple ordered locus names.png]]
</div>
* Additionally, from the IDs reported by the GenMAPP users as missing, it was revealed that these do not exist in the XML file at all, or at least in the format that we wanted. These sets of IDs were actually misnomers since, even though CTRL + F lets us find them, they are not the ordered locus names that we were looking for:

* Example 1:
<div class="center">
[[Image: Match pic.png]]
</div>
* Example 2:
<div class="center">
[[Image: Id misnomers2.png]]
</div>
* However, because of these observations, we have actually discovered ~92 IDs that existed within the XML file, albeit in a different tag than what we were using:
<div class="center">
[[Image: Id misnomers.png]]
</div>
* All these observations led us to make one final build to capture those ~92 gene IDs.

== Export Information (final) ==
* Date: 12/14/15
Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151208.gdb | Sf-Std 20151208.gdb]]'''
* Time taken to export: '''2 hour, 0 minutes, 27 seconds'''
** Start time: '''9:35:00 PM PDT'''
** End time: '''11:35:27 PM PDT'''

=== Using TallyEngine ===
* The results are shown below:
<div class="center">
[[Image:Tally results build2.png]]
</div>

=== Using Microsoft Access ===
* Even though the results from the TallyEngine say different numbers, the OrderedLocusNames that were exported in the .gdb file is the following:
<div class="center">
[[Image:Orderedlocusnames.png]]
</div>

=== .gdb File ===
* The resulting .gdb file can be downloaded [[Media: Sf-Std 20151214.gdb | here]].

File:Sf-Std 20151214.gdb

2015-12-15T08:22:35Z

Troque: Uploading final build

Uploading final build

File:Orderedlocusnames.png

2015-12-15T08:19:55Z

Troque: uploading orderedlocus from ms access

uploading orderedlocus from ms access

Gene Database Testing Report - Oregon Trail Survivors

2015-12-15T08:19:32Z

Troque: /* Export Information (final) */ added picture

{{Template:Oregon Trail Survivors}}

== Things to note ==
* Taxonomy ID: 623
* UP000001006
* File management system: Wiki

== Initial (Vanilla) Export Information ==
Version of GenMAPP Builder:
* ''' gmbuilder-3.0.0-build-5 '''

Computer on which export was run:
* ''' Front of the room, 3rd computer from the right. '''

Postgres Database name:
* '''Shigella flexneri 20151911'''

UniProt XML filename (give filename and upload and link to compressed file):
* UniProt XML version (The version information can be found at [http://uniprot.org/news the UniProt News Page]): '''UniProt release 2015_11'''
* UniProt XML download link: '''[http://www.uniprot.org/uniprot/?query=proteome:UP000001006 Click here]'''
* Time taken to import: '''4.48 minutes'''
** Note:

GO OBO-XML filename (give filename and upload and link to compressed file):
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped): '''Version created on 11/19/2015 (at 2:24 AM)'''
* GO OBO-XML download link: '''[http://archive.geneontology.org/latest-termdb/go_daily-termdb.obo-xml.gz Click here to download].'''
* Time taken to import: '''7.00 minutes'''
* Time taken to process: '''4.99 minutes'''
** Note:

GOA filename (give filename and upload and link to compressed file):
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''Version released on .'''
* GOA download link: '''[http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/103.S_flexneri_301.goa Click here to download].'''
* Time taken to import: '''0.06 minutes'''
** Note:

Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151119 OTS.gdb | Sf-Std_20151119_OTS.gdb]]'''
* Time taken to export: ''' 1 Hours, 32 Minutes, 33 Seconds '''
** Start time: '''4:06:13 PM PM PDT'''
** End time: '''5:38:46 PM PDT'''
** Note:

== Export Information for Build with Coder Changes # 1 ==
=== Build 1 ===
Name of .gdb file: '''[[Media:Sf-Std 20151201.gdb | Sf-Std_20151201.gdb]]'''
* Time taken to export: ''' '''
** Start time: '''4:19:22 PM PDT'''
** End time: ''' 8:30:08 PM PDT'''
** I have confirmed that the necessary information in the .gdb file exist in the new build (e.g. the URL of the database we are using).

=== TallyEngine ===

* Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
** Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
** Choose the UniProt and GO OBO XML files that was uploaded from the previous sections of this assignment.
** Here is the screenshot of the tally result:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
[[Image:TallyEngine results OTS 112115.jpg]]
</div>

=== Using XMLPipeDB match to Validate the XML Results from the TallyEngine ===
[[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | Follow the instructions found on this page to run XMLPipeDB match.]]
* In the Thawspace directory, I created a folder called "Shigella_flexneri_BioDB_2015" and created subfolders called "Source" and "Working" to store the source files (i.e., the compressed files) and the working files (i.e., the files I will actually be processing).
* As a result, I had to cd to these directories first before using the command for using Match.
** In order to change into the ThawSpace0\Shigella_flexneri_BioDB_2015\Working directory, use the following commands on the command prompt window:
T: && cd "Shigella_flexneri_BioDB_2015\Working"
* The command I used once inside the directory I want is:
java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "SF[0-9][0-9][0-9][0-9]" < uniprot-proteome%3AUP000001006.xml
* The results are as follows:
<div class="center">
[[Image:Match results OTS 112115.jpg]]
</div>
These results did not match up with what the TallyEngine gave (TallyEngine: 7567 vs. Match: 4610)
* As a result, the commands would have to be modified somehow so that the numbers match: <code>(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>
* The overall command to write to a text file is as follows:
java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)" < uniprot-proteome%3AUP000001006.xml > shigella_flexneri_results.txt
* Then our results became:
<div class="center">
[[Image:Match results OTS 20151203 more accurate.jpg]]
</div>
* Observations:
** In order to lessen the number of matches, we had to add the end tag "</name>" to our regular expression. This brought down the number of matches from over 8000, to just 7517. Since TallyEngine's results were 7567, this means that 150 IDs were not being caught. In order to account for this, we had to add the genes with ID's of the form CP#### (there were 50 instances of these), and those with the form SF####.# or S####.#. This led us to get 7566 gene IDs.
** When I looked at the IDs in Microsoft Access, the IDs total 7569. In order to account for this last piece of gene formatting, we also had to account for the genes with the form SF?####/SF?####. These 2 extra genes that were not accounted for by TallyEngine is actually not supposed to be separated since the genes are formatted such that it can be interpreted that the IDs are interchangeable. When the gdb file was created, it would seem that these genes have been split down the "/".
** In other words, there are 3 ordered locus names with formatting that is different from the rest: SF2223/SF2224, S2352/S2353, and S3359/S3360.
** I wasn't able to exactly hit the number outputted by Tally Engine since there are other genes with the same format that were already caught with the patterns SF#### or S####.
** Note: It turns out the ShiBASE database only uses the pattern SF#### instead of S#### or CP#### so the regular expression would really have to be just <code>SF?[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>



=== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine ===
* The command used to count the number of IDs is:
select count(*) from genenametype where type = "ordered locus" and value ~ "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?";
<div class="center">
[[Image:Postgres results OTS 20151203.jpg]]
</div>
* The result above is exactly twice as much as the number of OrderedLocusNames from TallyEngine: 15134 / 2 = 7567 IDs
* A quick peek at the results after the command <code>select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?';</code> is used and the results are exported to Excel reveals that this is because every single entry is entered twice:
<div class="center">
[[Image:Postgres results excel form OTS 20151203.jpg]]
</div>
* Adding the keyword "distinct" would resolve the double counting:
select distinct value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?

=== Analysis ===
* The total number of OrderedLocusNames in TallyEngine is '''7567'''.
* Using the (best I could) regular expression in Match, the result is '''7573'''. The additional 6 IDs emerged since those are originally captured by the regular expression <code>(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?</name></code> and trying to capture the IDs of the form <code>SF?####/SF?####</code> would duplicate those captured IDs.
* The total of entries in PostGreSQL is '''15134''', but this is only because each gene is repeated twice. As a result, diving by 2 would actually yield '''7567'''.
* Microsoft Access yielded '''7569''' in the OrderedLocusNames window. The extra 2 genes came from the IDs of the form <code>SF?####/SF?####</code> since the export broke up the two IDs that represent the same ID.
** 49 are of the form <code>CP####</code>
** 3413 are of the form <code>S####</code>
*** 14 are of the form <code>S####.#</code>
** 4107 are of the form <code>SF####</code>
*** 35 are of the form <code>SF####.#</code>
* Inspecting the UniProt XML file was necessary in identifying the IDs. Looking through what was inside, I discovered (with help from Dondi) that I had to add the end tag "</name>" in order to narrow down the results in Match

== "Export" from Build 2 ==
Name of .gdb file: [[Media:Sf-Std 20151207.gdb | Sf-Std 20151207.gdb]]
* Date: ''' 12/7/15 '''
* Time taken to export: ''' 4 hours, 24 minutes and 1 second '''
** Start time: '''9:13:45 PM PDT'''
** End time: ''' 1:37:46 AM PDT'''
** Note: This export had to be redone since the PSQL database had twice as much entries.

== Export Information (Re-imported) Build 2 ==
Version of GenMAPP Builder:
* ''' gmbuilder-3.0.0-build-5 '''

Computer on which export was run:
* ''' Front of the room, 3rd computer from the right. '''

Postgres Database name:
* '''Shigella_flexneri_20151208'''

UniProt XML filename (give filename and upload and link to compressed file):
* UniProt XML version (The version information can be found at [http://uniprot.org/news the UniProt News Page]): '''UniProt release 2015_11'''
* UniProt XML download link: '''[http://www.uniprot.org/uniprot/?query=proteome:UP000001006 Click here]'''
* Time taken to import: '''4.43 minutes'''
** Note:

GO OBO-XML filename (give filename and upload and link to compressed file):
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped): '''Version created on 11/19/2015 (at 2:24 AM)'''
* GO OBO-XML download link: '''[http://archive.geneontology.org/latest-termdb/go_daily-termdb.obo-xml.gz Click here to download].'''
* Time taken to import: '''6.84 minutes'''
* Time taken to process: '''5.49 minutes'''
** Note:

GOA filename (give filename and upload and link to compressed file):
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''Version released on .'''
* GOA download link: '''[http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/103.S_flexneri_301.goa Click here to download].'''
* Time taken to import: '''0.06 minutes'''
** Note:

Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151208.gdb | Sf-Std 20151208.gdb]]'''
* Time taken to export: '''1 hour, 38 minutes, 42 seconds '''
** Start time: '''4:30:59 PM PDT'''
** End time: '''6:09:41 PM PDT'''
** Note: The reason why I had to re-import everything into a new database is because the one I have been using had some files imported twice. Thus, the reports given by PostGres were all twice as much.

=== Using TallyEngine ===
* The database used is the same one described in the section above: '''Shigella_flexneri_20151208'''
* Notice in the image below that there is an error in the cells. It turns out that we did not even need to add the Ordered Locus since that was the default. We will definitely need to do one last build in order to fix that issue.
<div class="center">
[[Image:Shigella flexneri tallyEngine results build 2.png]]
</div>

=== Using XMLPipeDB match to Validate the XML Results from the TallyEngine ===
<div class="center">
[[Image:regex1_OTS.png]]
</div>

<div class="center">
[[Image:regex2_OTS.png]]
</div>

* When added together, the results becomes 7566 + 3 = 7569.

=== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine ===
* The following command in PostGreSQL resulted in 7567 entries:
select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?';
* The following command resulted in 214 entries:
select value from genenametype where type = 'ORF' and value ~ '(CP|SF?)(_p)?[0-9][0-9][0-9][0-9](\.[0-9])?';

=== OriginalRowCounts Comparison ===
<div class="center">
[[Image:Ms access originalrowcounts.png]]
</div>
* The OrderedLocusNames row seems to report on the same number of IDs as our previous builds

=== Visual Inspection ===
Perform visual inspection of individual tables to see if there are any problems.

* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
** Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
** Yes, all of them seem to follow the same format (there ares more or less, 3 variations on the IDs for each of the tables).

=== Excel Inspection ===
* [[Media:In-search-of-the-missing-ids.xlsx| Excel file]]

=== Observations ===
* Through the use of an XML-reader program, called "firstObject XML Editor", it was discovered that some ordered locus IDs that were exported by GenMAPPBuilder were placed in the same tag:
<div class="center">
[[Image: Dual ordered locus names.png]]
</div>
* These differed from the ones originally captured (7567) since these existed separately in each of the gene/name tags:
<div class="center">
[[Image: Simple ordered locus names.png]]
</div>
* Additionally, from the IDs reported by the GenMAPP users as missing, it was revealed that these do not exist in the XML file at all, or at least in the format that we wanted. These sets of IDs were actually misnomers since, even though CTRL + F lets us find them, they are not the ordered locus names that we were looking for:

* Example 1:
<div class="center">
[[Image: Match pic.png]]
</div>
* Example 2:
<div class="center">
[[Image: Id misnomers2.png]]
</div>
* However, because of these observations, we have actually discovered ~92 IDs that existed within the XML file, albeit in a different tag than what we were using:
<div class="center">
[[Image: Id misnomers.png]]
</div>
* All these observations led us to make one final build to capture those ~92 gene IDs.

== Export Information (final) ==
* Date: 12/14/15
Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151208.gdb | Sf-Std 20151208.gdb]]'''
* Time taken to export: '''2 hour, 0 minutes, 27 seconds'''
** Start time: '''9:35:00 PM PDT'''
** End time: '''11:35:27 PM PDT'''
=== Using TallyEngine ===
* The results are shown below:
<div class="center">
[[Image:Tally results build2.png]]
</div>

File:Tally results build2.png

2015-12-15T08:18:35Z

Troque: uploading a new one

uploading a new one

Gene Database Testing Report - Oregon Trail Survivors

2015-12-15T06:40:47Z

Troque: /* Export from "Build 2" */ Edited header

{{Template:Oregon Trail Survivors}}

== Things to note ==
* Taxonomy ID: 623
* UP000001006
* File management system: Wiki

== Initial (Vanilla) Export Information ==
Version of GenMAPP Builder:
* ''' gmbuilder-3.0.0-build-5 '''

Computer on which export was run:
* ''' Front of the room, 3rd computer from the right. '''

Postgres Database name:
* '''Shigella flexneri 20151911'''

UniProt XML filename (give filename and upload and link to compressed file):
* UniProt XML version (The version information can be found at [http://uniprot.org/news the UniProt News Page]): '''UniProt release 2015_11'''
* UniProt XML download link: '''[http://www.uniprot.org/uniprot/?query=proteome:UP000001006 Click here]'''
* Time taken to import: '''4.48 minutes'''
** Note:

GO OBO-XML filename (give filename and upload and link to compressed file):
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped): '''Version created on 11/19/2015 (at 2:24 AM)'''
* GO OBO-XML download link: '''[http://archive.geneontology.org/latest-termdb/go_daily-termdb.obo-xml.gz Click here to download].'''
* Time taken to import: '''7.00 minutes'''
* Time taken to process: '''4.99 minutes'''
** Note:

GOA filename (give filename and upload and link to compressed file):
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''Version released on .'''
* GOA download link: '''[http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/103.S_flexneri_301.goa Click here to download].'''
* Time taken to import: '''0.06 minutes'''
** Note:

Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151119 OTS.gdb | Sf-Std_20151119_OTS.gdb]]'''
* Time taken to export: ''' 1 Hours, 32 Minutes, 33 Seconds '''
** Start time: '''4:06:13 PM PM PDT'''
** End time: '''5:38:46 PM PDT'''
** Note:

== Export Information for Build with Coder Changes # 1 ==
=== Build 1 ===
Name of .gdb file: '''[[Media:Sf-Std 20151201.gdb | Sf-Std_20151201.gdb]]'''
* Time taken to export: ''' '''
** Start time: '''4:19:22 PM PDT'''
** End time: ''' 8:30:08 PM PDT'''
** I have confirmed that the necessary information in the .gdb file exist in the new build (e.g. the URL of the database we are using).

=== TallyEngine ===

* Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
** Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
** Choose the UniProt and GO OBO XML files that was uploaded from the previous sections of this assignment.
** Here is the screenshot of the tally result:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
[[Image:TallyEngine results OTS 112115.jpg]]
</div>

=== Using XMLPipeDB match to Validate the XML Results from the TallyEngine ===
[[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | Follow the instructions found on this page to run XMLPipeDB match.]]
* In the Thawspace directory, I created a folder called "Shigella_flexneri_BioDB_2015" and created subfolders called "Source" and "Working" to store the source files (i.e., the compressed files) and the working files (i.e., the files I will actually be processing).
* As a result, I had to cd to these directories first before using the command for using Match.
** In order to change into the ThawSpace0\Shigella_flexneri_BioDB_2015\Working directory, use the following commands on the command prompt window:
T: && cd "Shigella_flexneri_BioDB_2015\Working"
* The command I used once inside the directory I want is:
java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "SF[0-9][0-9][0-9][0-9]" < uniprot-proteome%3AUP000001006.xml
* The results are as follows:
<div class="center">
[[Image:Match results OTS 112115.jpg]]
</div>
These results did not match up with what the TallyEngine gave (TallyEngine: 7567 vs. Match: 4610)
* As a result, the commands would have to be modified somehow so that the numbers match: <code>(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>
* The overall command to write to a text file is as follows:
java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)" < uniprot-proteome%3AUP000001006.xml > shigella_flexneri_results.txt
* Then our results became:
<div class="center">
[[Image:Match results OTS 20151203 more accurate.jpg]]
</div>
* Observations:
** In order to lessen the number of matches, we had to add the end tag "</name>" to our regular expression. This brought down the number of matches from over 8000, to just 7517. Since TallyEngine's results were 7567, this means that 150 IDs were not being caught. In order to account for this, we had to add the genes with ID's of the form CP#### (there were 50 instances of these), and those with the form SF####.# or S####.#. This led us to get 7566 gene IDs.
** When I looked at the IDs in Microsoft Access, the IDs total 7569. In order to account for this last piece of gene formatting, we also had to account for the genes with the form SF?####/SF?####. These 2 extra genes that were not accounted for by TallyEngine is actually not supposed to be separated since the genes are formatted such that it can be interpreted that the IDs are interchangeable. When the gdb file was created, it would seem that these genes have been split down the "/".
** In other words, there are 3 ordered locus names with formatting that is different from the rest: SF2223/SF2224, S2352/S2353, and S3359/S3360.
** I wasn't able to exactly hit the number outputted by Tally Engine since there are other genes with the same format that were already caught with the patterns SF#### or S####.
** Note: It turns out the ShiBASE database only uses the pattern SF#### instead of S#### or CP#### so the regular expression would really have to be just <code>SF?[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>



=== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine ===
* The command used to count the number of IDs is:
select count(*) from genenametype where type = "ordered locus" and value ~ "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?";
<div class="center">
[[Image:Postgres results OTS 20151203.jpg]]
</div>
* The result above is exactly twice as much as the number of OrderedLocusNames from TallyEngine: 15134 / 2 = 7567 IDs
* A quick peek at the results after the command <code>select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?';</code> is used and the results are exported to Excel reveals that this is because every single entry is entered twice:
<div class="center">
[[Image:Postgres results excel form OTS 20151203.jpg]]
</div>
* Adding the keyword "distinct" would resolve the double counting:
select distinct value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?

=== Analysis ===
* The total number of OrderedLocusNames in TallyEngine is '''7567'''.
* Using the (best I could) regular expression in Match, the result is '''7573'''. The additional 6 IDs emerged since those are originally captured by the regular expression <code>(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?</name></code> and trying to capture the IDs of the form <code>SF?####/SF?####</code> would duplicate those captured IDs.
* The total of entries in PostGreSQL is '''15134''', but this is only because each gene is repeated twice. As a result, diving by 2 would actually yield '''7567'''.
* Microsoft Access yielded '''7569''' in the OrderedLocusNames window. The extra 2 genes came from the IDs of the form <code>SF?####/SF?####</code> since the export broke up the two IDs that represent the same ID.
** 49 are of the form <code>CP####</code>
** 3413 are of the form <code>S####</code>
*** 14 are of the form <code>S####.#</code>
** 4107 are of the form <code>SF####</code>
*** 35 are of the form <code>SF####.#</code>
* Inspecting the UniProt XML file was necessary in identifying the IDs. Looking through what was inside, I discovered (with help from Dondi) that I had to add the end tag "</name>" in order to narrow down the results in Match

== "Export" from Build 2 ==
Name of .gdb file: [[Media:Sf-Std 20151207.gdb | Sf-Std 20151207.gdb]]
* Date: ''' 12/7/15 '''
* Time taken to export: ''' 4 hours, 24 minutes and 1 second '''
** Start time: '''9:13:45 PM PDT'''
** End time: ''' 1:37:46 AM PDT'''
** Note: This export had to be redone since the PSQL database had twice as much entries.

== Export Information (Re-imported) Build 2 ==
Version of GenMAPP Builder:
* ''' gmbuilder-3.0.0-build-5 '''

Computer on which export was run:
* ''' Front of the room, 3rd computer from the right. '''

Postgres Database name:
* '''Shigella_flexneri_20151208'''

UniProt XML filename (give filename and upload and link to compressed file):
* UniProt XML version (The version information can be found at [http://uniprot.org/news the UniProt News Page]): '''UniProt release 2015_11'''
* UniProt XML download link: '''[http://www.uniprot.org/uniprot/?query=proteome:UP000001006 Click here]'''
* Time taken to import: '''4.43 minutes'''
** Note:

GO OBO-XML filename (give filename and upload and link to compressed file):
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped): '''Version created on 11/19/2015 (at 2:24 AM)'''
* GO OBO-XML download link: '''[http://archive.geneontology.org/latest-termdb/go_daily-termdb.obo-xml.gz Click here to download].'''
* Time taken to import: '''6.84 minutes'''
* Time taken to process: '''5.49 minutes'''
** Note:

GOA filename (give filename and upload and link to compressed file):
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''Version released on .'''
* GOA download link: '''[http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/103.S_flexneri_301.goa Click here to download].'''
* Time taken to import: '''0.06 minutes'''
** Note:

Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151208.gdb | Sf-Std 20151208.gdb]]'''
* Time taken to export: '''1 hour, 38 minutes, 42 seconds '''
** Start time: '''4:30:59 PM PDT'''
** End time: '''6:09:41 PM PDT'''
** Note: The reason why I had to re-import everything into a new database is because the one I have been using had some files imported twice. Thus, the reports given by PostGres were all twice as much.

=== Using TallyEngine ===
* The database used is the same one described in the section above: '''Shigella_flexneri_20151208'''
* Notice in the image below that there is an error in the cells. It turns out that we did not even need to add the Ordered Locus since that was the default. We will definitely need to do one last build in order to fix that issue.
<div class="center">
[[Image:Shigella flexneri tallyEngine results build 2.png]]
</div>

=== Using XMLPipeDB match to Validate the XML Results from the TallyEngine ===
<div class="center">
[[Image:regex1_OTS.png]]
</div>

<div class="center">
[[Image:regex2_OTS.png]]
</div>

* When added together, the results becomes 7566 + 3 = 7569.

=== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine ===
* The following command in PostGreSQL resulted in 7567 entries:
select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?';
* The following command resulted in 214 entries:
select value from genenametype where type = 'ORF' and value ~ '(CP|SF?)(_p)?[0-9][0-9][0-9][0-9](\.[0-9])?';

=== OriginalRowCounts Comparison ===
<div class="center">
[[Image:Ms access originalrowcounts.png]]
</div>
* The OrderedLocusNames row seems to report on the same number of IDs as our previous builds

=== Visual Inspection ===
Perform visual inspection of individual tables to see if there are any problems.

* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
** Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
** Yes, all of them seem to follow the same format (there ares more or less, 3 variations on the IDs for each of the tables).

=== Excel Inspection ===
* [[Media:In-search-of-the-missing-ids.xlsx| Excel file]]

=== Observations ===
* Through the use of an XML-reader program, called "firstObject XML Editor", it was discovered that some ordered locus IDs that were exported by GenMAPPBuilder were placed in the same tag:
<div class="center">
[[Image: Dual ordered locus names.png]]
</div>
* These differed from the ones originally captured (7567) since these existed separately in each of the gene/name tags:
<div class="center">
[[Image: Simple ordered locus names.png]]
</div>
* Additionally, from the IDs reported by the GenMAPP users as missing, it was revealed that these do not exist in the XML file at all, or at least in the format that we wanted. These sets of IDs were actually misnomers since, even though CTRL + F lets us find them, they are not the ordered locus names that we were looking for:

* Example 1:
<div class="center">
[[Image: Match pic.png]]
</div>
* Example 2:
<div class="center">
[[Image: Id misnomers2.png]]
</div>
* However, because of these observations, we have actually discovered ~92 IDs that existed within the XML file, albeit in a different tag than what we were using:
<div class="center">
[[Image: Id misnomers.png]]
</div>
* All these observations led us to make one final build to capture those ~92 gene IDs.

== Export Information (final) ==
* Date: 12/14/15
Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151208.gdb | Sf-Std 20151208.gdb]]'''
* Time taken to export: '''1 hour, 38 minutes, 42 seconds '''
** Start time: '''9:35 PM PDT'''
** End time: ''' PM PDT'''

Gene Database Testing Report - Oregon Trail Survivors

2015-12-15T06:39:44Z

Troque: /* Export Information (Re-imported) Build 2 */ Added more pictures

{{Template:Oregon Trail Survivors}}

== Things to note ==
* Taxonomy ID: 623
* UP000001006
* File management system: Wiki

== Initial (Vanilla) Export Information ==
Version of GenMAPP Builder:
* ''' gmbuilder-3.0.0-build-5 '''

Computer on which export was run:
* ''' Front of the room, 3rd computer from the right. '''

Postgres Database name:
* '''Shigella flexneri 20151911'''

UniProt XML filename (give filename and upload and link to compressed file):
* UniProt XML version (The version information can be found at [http://uniprot.org/news the UniProt News Page]): '''UniProt release 2015_11'''
* UniProt XML download link: '''[http://www.uniprot.org/uniprot/?query=proteome:UP000001006 Click here]'''
* Time taken to import: '''4.48 minutes'''
** Note:

GO OBO-XML filename (give filename and upload and link to compressed file):
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped): '''Version created on 11/19/2015 (at 2:24 AM)'''
* GO OBO-XML download link: '''[http://archive.geneontology.org/latest-termdb/go_daily-termdb.obo-xml.gz Click here to download].'''
* Time taken to import: '''7.00 minutes'''
* Time taken to process: '''4.99 minutes'''
** Note:

GOA filename (give filename and upload and link to compressed file):
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''Version released on .'''
* GOA download link: '''[http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/103.S_flexneri_301.goa Click here to download].'''
* Time taken to import: '''0.06 minutes'''
** Note:

Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151119 OTS.gdb | Sf-Std_20151119_OTS.gdb]]'''
* Time taken to export: ''' 1 Hours, 32 Minutes, 33 Seconds '''
** Start time: '''4:06:13 PM PM PDT'''
** End time: '''5:38:46 PM PDT'''
** Note:

== Export Information for Build with Coder Changes # 1 ==
=== Build 1 ===
Name of .gdb file: '''[[Media:Sf-Std 20151201.gdb | Sf-Std_20151201.gdb]]'''
* Time taken to export: ''' '''
** Start time: '''4:19:22 PM PDT'''
** End time: ''' 8:30:08 PM PDT'''
** I have confirmed that the necessary information in the .gdb file exist in the new build (e.g. the URL of the database we are using).

=== TallyEngine ===

* Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
** Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
** Choose the UniProt and GO OBO XML files that was uploaded from the previous sections of this assignment.
** Here is the screenshot of the tally result:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
[[Image:TallyEngine results OTS 112115.jpg]]
</div>

=== Using XMLPipeDB match to Validate the XML Results from the TallyEngine ===
[[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | Follow the instructions found on this page to run XMLPipeDB match.]]
* In the Thawspace directory, I created a folder called "Shigella_flexneri_BioDB_2015" and created subfolders called "Source" and "Working" to store the source files (i.e., the compressed files) and the working files (i.e., the files I will actually be processing).
* As a result, I had to cd to these directories first before using the command for using Match.
** In order to change into the ThawSpace0\Shigella_flexneri_BioDB_2015\Working directory, use the following commands on the command prompt window:
T: && cd "Shigella_flexneri_BioDB_2015\Working"
* The command I used once inside the directory I want is:
java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "SF[0-9][0-9][0-9][0-9]" < uniprot-proteome%3AUP000001006.xml
* The results are as follows:
<div class="center">
[[Image:Match results OTS 112115.jpg]]
</div>
These results did not match up with what the TallyEngine gave (TallyEngine: 7567 vs. Match: 4610)
* As a result, the commands would have to be modified somehow so that the numbers match: <code>(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>
* The overall command to write to a text file is as follows:
java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)" < uniprot-proteome%3AUP000001006.xml > shigella_flexneri_results.txt
* Then our results became:
<div class="center">
[[Image:Match results OTS 20151203 more accurate.jpg]]
</div>
* Observations:
** In order to lessen the number of matches, we had to add the end tag "</name>" to our regular expression. This brought down the number of matches from over 8000, to just 7517. Since TallyEngine's results were 7567, this means that 150 IDs were not being caught. In order to account for this, we had to add the genes with ID's of the form CP#### (there were 50 instances of these), and those with the form SF####.# or S####.#. This led us to get 7566 gene IDs.
** When I looked at the IDs in Microsoft Access, the IDs total 7569. In order to account for this last piece of gene formatting, we also had to account for the genes with the form SF?####/SF?####. These 2 extra genes that were not accounted for by TallyEngine is actually not supposed to be separated since the genes are formatted such that it can be interpreted that the IDs are interchangeable. When the gdb file was created, it would seem that these genes have been split down the "/".
** In other words, there are 3 ordered locus names with formatting that is different from the rest: SF2223/SF2224, S2352/S2353, and S3359/S3360.
** I wasn't able to exactly hit the number outputted by Tally Engine since there are other genes with the same format that were already caught with the patterns SF#### or S####.
** Note: It turns out the ShiBASE database only uses the pattern SF#### instead of S#### or CP#### so the regular expression would really have to be just <code>SF?[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>



=== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine ===
* The command used to count the number of IDs is:
select count(*) from genenametype where type = "ordered locus" and value ~ "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?";
<div class="center">
[[Image:Postgres results OTS 20151203.jpg]]
</div>
* The result above is exactly twice as much as the number of OrderedLocusNames from TallyEngine: 15134 / 2 = 7567 IDs
* A quick peek at the results after the command <code>select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?';</code> is used and the results are exported to Excel reveals that this is because every single entry is entered twice:
<div class="center">
[[Image:Postgres results excel form OTS 20151203.jpg]]
</div>
* Adding the keyword "distinct" would resolve the double counting:
select distinct value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?

=== Analysis ===
* The total number of OrderedLocusNames in TallyEngine is '''7567'''.
* Using the (best I could) regular expression in Match, the result is '''7573'''. The additional 6 IDs emerged since those are originally captured by the regular expression <code>(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?</name></code> and trying to capture the IDs of the form <code>SF?####/SF?####</code> would duplicate those captured IDs.
* The total of entries in PostGreSQL is '''15134''', but this is only because each gene is repeated twice. As a result, diving by 2 would actually yield '''7567'''.
* Microsoft Access yielded '''7569''' in the OrderedLocusNames window. The extra 2 genes came from the IDs of the form <code>SF?####/SF?####</code> since the export broke up the two IDs that represent the same ID.
** 49 are of the form <code>CP####</code>
** 3413 are of the form <code>S####</code>
*** 14 are of the form <code>S####.#</code>
** 4107 are of the form <code>SF####</code>
*** 35 are of the form <code>SF####.#</code>
* Inspecting the UniProt XML file was necessary in identifying the IDs. Looking through what was inside, I discovered (with help from Dondi) that I had to add the end tag "</name>" in order to narrow down the results in Match

== Export from "Build 2" ==
Name of .gdb file: [[Media:Sf-Std 20151207.gdb | Sf-Std 20151207.gdb]]
* Date: ''' 12/7/15 '''
* Time taken to export: ''' 4 hours, 24 minutes and 1 second '''
** Start time: '''9:13:45 PM PDT'''
** End time: ''' 1:37:46 AM PDT'''
** Note: This export had to be redone since the PSQL database had twice as much entries.

== Export Information (Re-imported) Build 2 ==
Version of GenMAPP Builder:
* ''' gmbuilder-3.0.0-build-5 '''

Computer on which export was run:
* ''' Front of the room, 3rd computer from the right. '''

Postgres Database name:
* '''Shigella_flexneri_20151208'''

UniProt XML filename (give filename and upload and link to compressed file):
* UniProt XML version (The version information can be found at [http://uniprot.org/news the UniProt News Page]): '''UniProt release 2015_11'''
* UniProt XML download link: '''[http://www.uniprot.org/uniprot/?query=proteome:UP000001006 Click here]'''
* Time taken to import: '''4.43 minutes'''
** Note:

GO OBO-XML filename (give filename and upload and link to compressed file):
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped): '''Version created on 11/19/2015 (at 2:24 AM)'''
* GO OBO-XML download link: '''[http://archive.geneontology.org/latest-termdb/go_daily-termdb.obo-xml.gz Click here to download].'''
* Time taken to import: '''6.84 minutes'''
* Time taken to process: '''5.49 minutes'''
** Note:

GOA filename (give filename and upload and link to compressed file):
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''Version released on .'''
* GOA download link: '''[http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/103.S_flexneri_301.goa Click here to download].'''
* Time taken to import: '''0.06 minutes'''
** Note:

Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151208.gdb | Sf-Std 20151208.gdb]]'''
* Time taken to export: '''1 hour, 38 minutes, 42 seconds '''
** Start time: '''4:30:59 PM PDT'''
** End time: '''6:09:41 PM PDT'''
** Note: The reason why I had to re-import everything into a new database is because the one I have been using had some files imported twice. Thus, the reports given by PostGres were all twice as much.

=== Using TallyEngine ===
* The database used is the same one described in the section above: '''Shigella_flexneri_20151208'''
* Notice in the image below that there is an error in the cells. It turns out that we did not even need to add the Ordered Locus since that was the default. We will definitely need to do one last build in order to fix that issue.
<div class="center">
[[Image:Shigella flexneri tallyEngine results build 2.png]]
</div>

=== Using XMLPipeDB match to Validate the XML Results from the TallyEngine ===
<div class="center">
[[Image:regex1_OTS.png]]
</div>

<div class="center">
[[Image:regex2_OTS.png]]
</div>

* When added together, the results becomes 7566 + 3 = 7569.

=== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine ===
* The following command in PostGreSQL resulted in 7567 entries:
select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?';
* The following command resulted in 214 entries:
select value from genenametype where type = 'ORF' and value ~ '(CP|SF?)(_p)?[0-9][0-9][0-9][0-9](\.[0-9])?';

=== OriginalRowCounts Comparison ===
<div class="center">
[[Image:Ms access originalrowcounts.png]]
</div>
* The OrderedLocusNames row seems to report on the same number of IDs as our previous builds

=== Visual Inspection ===
Perform visual inspection of individual tables to see if there are any problems.

* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
** Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
** Yes, all of them seem to follow the same format (there ares more or less, 3 variations on the IDs for each of the tables).

=== Excel Inspection ===
* [[Media:In-search-of-the-missing-ids.xlsx| Excel file]]

=== Observations ===
* Through the use of an XML-reader program, called "firstObject XML Editor", it was discovered that some ordered locus IDs that were exported by GenMAPPBuilder were placed in the same tag:
<div class="center">
[[Image: Dual ordered locus names.png]]
</div>
* These differed from the ones originally captured (7567) since these existed separately in each of the gene/name tags:
<div class="center">
[[Image: Simple ordered locus names.png]]
</div>
* Additionally, from the IDs reported by the GenMAPP users as missing, it was revealed that these do not exist in the XML file at all, or at least in the format that we wanted. These sets of IDs were actually misnomers since, even though CTRL + F lets us find them, they are not the ordered locus names that we were looking for:

* Example 1:
<div class="center">
[[Image: Match pic.png]]
</div>
* Example 2:
<div class="center">
[[Image: Id misnomers2.png]]
</div>
* However, because of these observations, we have actually discovered ~92 IDs that existed within the XML file, albeit in a different tag than what we were using:
<div class="center">
[[Image: Id misnomers.png]]
</div>
* All these observations led us to make one final build to capture those ~92 gene IDs.

== Export Information (final) ==
* Date: 12/14/15
Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151208.gdb | Sf-Std 20151208.gdb]]'''
* Time taken to export: '''1 hour, 38 minutes, 42 seconds '''
** Start time: '''9:35 PM PDT'''
** End time: ''' PM PDT'''

Gene Database Testing Report - Oregon Trail Survivors

2015-12-15T06:24:42Z

Troque: Added other builds

{{Template:Oregon Trail Survivors}}

== Things to note ==
* Taxonomy ID: 623
* UP000001006
* File management system: Wiki

== Initial (Vanilla) Export Information ==
Version of GenMAPP Builder:
* ''' gmbuilder-3.0.0-build-5 '''

Computer on which export was run:
* ''' Front of the room, 3rd computer from the right. '''

Postgres Database name:
* '''Shigella flexneri 20151911'''

UniProt XML filename (give filename and upload and link to compressed file):
* UniProt XML version (The version information can be found at [http://uniprot.org/news the UniProt News Page]): '''UniProt release 2015_11'''
* UniProt XML download link: '''[http://www.uniprot.org/uniprot/?query=proteome:UP000001006 Click here]'''
* Time taken to import: '''4.48 minutes'''
** Note:

GO OBO-XML filename (give filename and upload and link to compressed file):
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped): '''Version created on 11/19/2015 (at 2:24 AM)'''
* GO OBO-XML download link: '''[http://archive.geneontology.org/latest-termdb/go_daily-termdb.obo-xml.gz Click here to download].'''
* Time taken to import: '''7.00 minutes'''
* Time taken to process: '''4.99 minutes'''
** Note:

GOA filename (give filename and upload and link to compressed file):
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''Version released on .'''
* GOA download link: '''[http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/103.S_flexneri_301.goa Click here to download].'''
* Time taken to import: '''0.06 minutes'''
** Note:

Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151119 OTS.gdb | Sf-Std_20151119_OTS.gdb]]'''
* Time taken to export: ''' 1 Hours, 32 Minutes, 33 Seconds '''
** Start time: '''4:06:13 PM PM PDT'''
** End time: '''5:38:46 PM PDT'''
** Note:

== Export Information for Build with Coder Changes # 1 ==
=== Build 1 ===
Name of .gdb file: '''[[Media:Sf-Std 20151201.gdb | Sf-Std_20151201.gdb]]'''
* Time taken to export: ''' '''
** Start time: '''4:19:22 PM PDT'''
** End time: ''' 8:30:08 PM PDT'''
** I have confirmed that the necessary information in the .gdb file exist in the new build (e.g. the URL of the database we are using).

=== TallyEngine ===

* Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
** Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
** Choose the UniProt and GO OBO XML files that was uploaded from the previous sections of this assignment.
** Here is the screenshot of the tally result:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
[[Image:TallyEngine results OTS 112115.jpg]]
</div>

=== Using XMLPipeDB match to Validate the XML Results from the TallyEngine ===
[[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | Follow the instructions found on this page to run XMLPipeDB match.]]
* In the Thawspace directory, I created a folder called "Shigella_flexneri_BioDB_2015" and created subfolders called "Source" and "Working" to store the source files (i.e., the compressed files) and the working files (i.e., the files I will actually be processing).
* As a result, I had to cd to these directories first before using the command for using Match.
** In order to change into the ThawSpace0\Shigella_flexneri_BioDB_2015\Working directory, use the following commands on the command prompt window:
T: && cd "Shigella_flexneri_BioDB_2015\Working"
* The command I used once inside the directory I want is:
java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "SF[0-9][0-9][0-9][0-9]" < uniprot-proteome%3AUP000001006.xml
* The results are as follows:
<div class="center">
[[Image:Match results OTS 112115.jpg]]
</div>
These results did not match up with what the TallyEngine gave (TallyEngine: 7567 vs. Match: 4610)
* As a result, the commands would have to be modified somehow so that the numbers match: <code>(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>
* The overall command to write to a text file is as follows:
java -jar xmlpipedb-match-1.1.1/xmlpipedb-match-1.1.1.jar "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)" < uniprot-proteome%3AUP000001006.xml > shigella_flexneri_results.txt
* Then our results became:
<div class="center">
[[Image:Match results OTS 20151203 more accurate.jpg]]
</div>
* Observations:
** In order to lessen the number of matches, we had to add the end tag "</name>" to our regular expression. This brought down the number of matches from over 8000, to just 7517. Since TallyEngine's results were 7567, this means that 150 IDs were not being caught. In order to account for this, we had to add the genes with ID's of the form CP#### (there were 50 instances of these), and those with the form SF####.# or S####.#. This led us to get 7566 gene IDs.
** When I looked at the IDs in Microsoft Access, the IDs total 7569. In order to account for this last piece of gene formatting, we also had to account for the genes with the form SF?####/SF?####. These 2 extra genes that were not accounted for by TallyEngine is actually not supposed to be separated since the genes are formatted such that it can be interpreted that the IDs are interchangeable. When the gdb file was created, it would seem that these genes have been split down the "/".
** In other words, there are 3 ordered locus names with formatting that is different from the rest: SF2223/SF2224, S2352/S2353, and S3359/S3360.
** I wasn't able to exactly hit the number outputted by Tally Engine since there are other genes with the same format that were already caught with the patterns SF#### or S####.
** Note: It turns out the ShiBASE database only uses the pattern SF#### instead of S#### or CP#### so the regular expression would really have to be just <code>SF?[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>



=== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine ===
* The command used to count the number of IDs is:
select count(*) from genenametype where type = "ordered locus" and value ~ "(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?";
<div class="center">
[[Image:Postgres results OTS 20151203.jpg]]
</div>
* The result above is exactly twice as much as the number of OrderedLocusNames from TallyEngine: 15134 / 2 = 7567 IDs
* A quick peek at the results after the command <code>select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?';</code> is used and the results are exported to Excel reveals that this is because every single entry is entered twice:
<div class="center">
[[Image:Postgres results excel form OTS 20151203.jpg]]
</div>
* Adding the keyword "distinct" would resolve the double counting:
select distinct value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?

=== Analysis ===
* The total number of OrderedLocusNames in TallyEngine is '''7567'''.
* Using the (best I could) regular expression in Match, the result is '''7573'''. The additional 6 IDs emerged since those are originally captured by the regular expression <code>(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?</name></code> and trying to capture the IDs of the form <code>SF?####/SF?####</code> would duplicate those captured IDs.
* The total of entries in PostGreSQL is '''15134''', but this is only because each gene is repeated twice. As a result, diving by 2 would actually yield '''7567'''.
* Microsoft Access yielded '''7569''' in the OrderedLocusNames window. The extra 2 genes came from the IDs of the form <code>SF?####/SF?####</code> since the export broke up the two IDs that represent the same ID.
** 49 are of the form <code>CP####</code>
** 3413 are of the form <code>S####</code>
*** 14 are of the form <code>S####.#</code>
** 4107 are of the form <code>SF####</code>
*** 35 are of the form <code>SF####.#</code>
* Inspecting the UniProt XML file was necessary in identifying the IDs. Looking through what was inside, I discovered (with help from Dondi) that I had to add the end tag "</name>" in order to narrow down the results in Match

== Export from "Build 2" ==
Name of .gdb file: [[Media:Sf-Std 20151207.gdb | Sf-Std 20151207.gdb]]
* Date: ''' 12/7/15 '''
* Time taken to export: ''' 4 hours, 24 minutes and 1 second '''
** Start time: '''9:13:45 PM PDT'''
** End time: ''' 1:37:46 AM PDT'''
** Note: This export had to be redone since the PSQL database had twice as much entries.

== Export Information (Re-imported) Build 2 ==
Version of GenMAPP Builder:
* ''' gmbuilder-3.0.0-build-5 '''

Computer on which export was run:
* ''' Front of the room, 3rd computer from the right. '''

Postgres Database name:
* '''Shigella_flexneri_20151208'''

UniProt XML filename (give filename and upload and link to compressed file):
* UniProt XML version (The version information can be found at [http://uniprot.org/news the UniProt News Page]): '''UniProt release 2015_11'''
* UniProt XML download link: '''[http://www.uniprot.org/uniprot/?query=proteome:UP000001006 Click here]'''
* Time taken to import: '''4.43 minutes'''
** Note:

GO OBO-XML filename (give filename and upload and link to compressed file):
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped): '''Version created on 11/19/2015 (at 2:24 AM)'''
* GO OBO-XML download link: '''[http://archive.geneontology.org/latest-termdb/go_daily-termdb.obo-xml.gz Click here to download].'''
* Time taken to import: '''6.84 minutes'''
* Time taken to process: '''5.49 minutes'''
** Note:

GOA filename (give filename and upload and link to compressed file):
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''Version released on .'''
* GOA download link: '''[http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/103.S_flexneri_301.goa Click here to download].'''
* Time taken to import: '''0.06 minutes'''
** Note:

Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151208.gdb | Sf-Std 20151208.gdb]]'''
* Time taken to export: '''1 hour, 38 minutes, 42 seconds '''
** Start time: '''4:30:59 PM PDT'''
** End time: '''6:09:41 PM PDT'''
** Note: The reason why I had to re-import everything into a new database is because the one I have been using had some files imported twice. Thus, the reports given by PostGres were all twice as much.

=== Using TallyEngine ===
* The database used is the same one described in the section above: '''Shigella_flexneri_20151208'''
* Notice in the image below that there is an error in the cells. It turns out that we did not even need to add the Ordered Locus since that was the default. We will definitely need to do one last build in order to fix that issue.
<div class="center">
[[Image:Shigella flexneri tallyEngine results build 2.png]]
</div>

=== Using XMLPipeDB match to Validate the XML Results from the TallyEngine ===
<div class="center">
[[Image:regex1_OTS.png]]
</div>

<div class="center">
[[Image:regex2_OTS.png]]
</div>

* When added together, the results becomes 7566 + 3 = 7569.

=== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine ===
* The following command in PostGreSQL resulted in 7567 entries:
select value from genenametype where type = 'ordered locus' and value ~ '(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?';
* The following command resulted in 214 entries:
select value from genenametype where type = 'ORF' and value ~ '(CP|SF?)(_p)?[0-9][0-9][0-9][0-9](\.[0-9])?';

=== OriginalRowCounts Comparison ===
<div class="center">
[[Image:Ms access originalrowcounts.png]]
</div>
* The OrderedLocusNames row seems to report on the same number of IDs as our previous builds

=== Visual Inspection ===
Perform visual inspection of individual tables to see if there are any problems.

* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
** Yes, there are dates present for GeneOntology, InterPro, GeneID, RefSeq, UniProt, EMBL, PDB, Pfam, OrderedLocusNames, and EnsemblBacteria.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
** Yes, all of them seem to follow the same format (there ares more or less, 3 variations on the IDs for each of the tables).

=== Excel Inspection ===
* [[Media:In-search-of-the-missing-ids.xlsx| Excel file]]

== Export Information (final) ==
* Date: 12/14/15
Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151208.gdb | Sf-Std 20151208.gdb]]'''
* Time taken to export: '''1 hour, 38 minutes, 42 seconds '''
** Start time: '''9:35 PM PDT'''
** End time: ''' PM PDT'''

Troque: Creating this page

{{Template:Troque}}

== Export Information ==
Version of GenMAPP Builder:
* ''' gmbuilder-3.0.0-build-5 '''

Computer on which export was run:
* ''' Front of the room, 3rd computer from the right. '''

Postgres Database name:
* '''Shigella_flexneri_20151208'''

UniProt XML filename (give filename and upload and link to compressed file):
* UniProt XML version (The version information can be found at [http://uniprot.org/news the UniProt News Page]): '''UniProt release 2015_11'''
* UniProt XML download link: '''[http://www.uniprot.org/uniprot/?query=proteome:UP000001006 Click here]'''
* Time taken to import: '''4.43 minutes'''
** Note:

GO OBO-XML filename (give filename and upload and link to compressed file):
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped): '''Version created on 11/19/2015 (at 2:24 AM)'''
* GO OBO-XML download link: '''[http://archive.geneontology.org/latest-termdb/go_daily-termdb.obo-xml.gz Click here to download].'''
* Time taken to import: ''' minutes'''
* Time taken to process: ''' minutes'''
** Note:

GOA filename (give filename and upload and link to compressed file):
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''Version released on .'''
* GOA download link: '''[http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/103.S_flexneri_301.goa Click here to download].'''
* Time taken to import: ''' minutes'''
** Note:

Name of .gdb file (give filename and upload and link to compressed file): '''[[ | ]]'''
* Time taken to export: ''' '''
** Start time: ''' PM PDT'''
** End time: ''' PM PDT'''
** Note:

{{Template:Troque:Journal}}

Gene Database Testing Report - Oregon Trail Survivors

2015-12-08T22:41:31Z

Troque: /* Export Information for Build with Coder Changes */ Added Build 2

Troque Week 14

2015-12-08T22:40:51Z

Troque: /* Build 2 */ Updated Build 2

{{Template:Troque}}
== Running New Builds ==
=== Build 1 ===
Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151201.gdb | Sf-Std_20151201.gdb]]'''
* Time taken to export: ''' 4 hours, 10 minutes, 46 seconds '''
** Start time: '''4:19:22 PM PDT'''
** End time: ''' 8:30:08 PM PDT'''
** Note:

=== Build 2 ===
Name of .gdb file: [[Media:Sf-Std 20151207.gdb | Sf-Std 20151207.gdb]]
* Date: ''' 12/7/15 '''
* Time taken to export: ''' 4 hours, 24 minutes and 1 second '''
** Start time: '''9:13:45 PM PDT'''
** End time: ''' 1:37:46 AM PDT'''
** Note:

== Important Files ==
* [[Media: Shigella flexneri results.txt | Text file written from Match]]
* [[Media:Shigella flexneri OrderedLocusNames OTS 20151201.xlsx | Ordered Locus Names from Microsoft Access]]

== Identifying the Gene IDs ==
* Regular expression: <code>(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>
* Observations:
** In order to lessen the number of matches, we had to add the end tag "</name>" to our regular expression. This brought down the number of matches from over 8000, to just 7517. Since TallyEngine's results were 7567, this means that 150 IDs were not being caught. In order to account for this, we had to add the genes with ID's of the form CP#### (there were 50 instances of these), and those with the form SF####.# or S####.#. This led us to get 7566 gene IDs.
** When I looked at the IDs in Microsoft Access, the IDs total 7569. In order to account for this last piece of gene formatting, we also had to account for the genes with the form SF?####/SF?####. These 2 extra genes that were not accounted for by TallyEngine is actually not supposed to be separated since the genes are formatted such that it can be interpreted that the IDs are interchangeable. When the gdb file was created, it would seem that these genes have been split down the "/".
** In other words, there are 3 ordered locus names with formatting that is different from the rest: SF2223/SF2224, S2352/S2353, and S3359/S3360.
** I wasn't able to exactly hit the number outputted by Tally Engine since there are other genes with the same format that were already caught with the patterns SF#### or S####.
** Note: It turns out the ShiBASE database only uses the pattern SF#### and CP#### instead of S#### so the regular expression would really have to be just <code>SF?[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>

'''FOR THE FULL REPORT ON IDENTIFYING THE ID, VISIT THE [[Gene_Database_Testing_Report_-_Oregon_Trail_Survivors | GENE DATABASE TESTING REPORT PAGE]].

== Reflection ==
# What worked?
#* What worked in identifying the gene IDs is to look export .gdb file into Excel and compare with what the OrderedLocusNames table had (from Microsoft Access). From doing this, it was easier to find which genes were not found in the .gdb file and made it easier to look through them in the UniProt XML file. With the Excel file comparing the lists of gene IDs and using the CTRL+F shortcut, I was also able to discern which tags to include into the new builds for the databases. Because of this, I was able to confirm that some genes indeed do not exist in the XML file, while only a couple exist within the "dbReference" tag.
# What didn't work?
#* What didn't work is using Match multiple times without thinking. Even when I was trying to match the number of gene IDs with what Tally Engine gives me, Match didn't really help me in identifying where to find the genes in the XML file.
# What will I do next to fix what didn't work?
#* What I would do next to fix what didn't work is to actually use Match in conjunction to the XML file, or just use the Excel method completely since that was actually more helpful in finding the necessary tags than the Match method.

{{Template:Troque_Journal}}

File:Sf-Std 20151207.gdb

2015-12-08T22:40:20Z

Troque:

Troque Week 14

2015-12-08T22:38:09Z

Troque: /* Running New Builds */ Updated End time

{{Template:Troque}}
== Running New Builds ==
=== Build 1 ===
Name of .gdb file (give filename and upload and link to compressed file): '''[[Media:Sf-Std 20151201.gdb | Sf-Std_20151201.gdb]]'''
* Time taken to export: ''' 4 hours, 10 minutes, 46 seconds '''
** Start time: '''4:19:22 PM PDT'''
** End time: ''' 8:30:08 PM PDT'''
** Note:

=== Build 2 ===
Name of .gdb file:
* Date: ''' 12/7/15 '''
* Time taken to export: ''' '''
** Start time: '''9:13:45 PM PDT'''
** End time: ''' 1:37:46 AM PDT'''
** Note:

== Important Files ==
* [[Media: Shigella flexneri results.txt | Text file written from Match]]
* [[Media:Shigella flexneri OrderedLocusNames OTS 20151201.xlsx | Ordered Locus Names from Microsoft Access]]

== Identifying the Gene IDs ==
* Regular expression: <code>(CP|SF?)[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>
* Observations:
** In order to lessen the number of matches, we had to add the end tag "</name>" to our regular expression. This brought down the number of matches from over 8000, to just 7517. Since TallyEngine's results were 7567, this means that 150 IDs were not being caught. In order to account for this, we had to add the genes with ID's of the form CP#### (there were 50 instances of these), and those with the form SF####.# or S####.#. This led us to get 7566 gene IDs.
** When I looked at the IDs in Microsoft Access, the IDs total 7569. In order to account for this last piece of gene formatting, we also had to account for the genes with the form SF?####/SF?####. These 2 extra genes that were not accounted for by TallyEngine is actually not supposed to be separated since the genes are formatted such that it can be interpreted that the IDs are interchangeable. When the gdb file was created, it would seem that these genes have been split down the "/".
** In other words, there are 3 ordered locus names with formatting that is different from the rest: SF2223/SF2224, S2352/S2353, and S3359/S3360.
** I wasn't able to exactly hit the number outputted by Tally Engine since there are other genes with the same format that were already caught with the patterns SF#### or S####.
** Note: It turns out the ShiBASE database only uses the pattern SF#### and CP#### instead of S#### so the regular expression would really have to be just <code>SF?[0-9][0-9][0-9][0-9](\.[0-9])?(/|</name>)</code>

'''FOR THE FULL REPORT ON IDENTIFYING THE ID, VISIT THE [[Gene_Database_Testing_Report_-_Oregon_Trail_Survivors | GENE DATABASE TESTING REPORT PAGE]].

== Reflection ==
# What worked?
#* What worked in identifying the gene IDs is to look export .gdb file into Excel and compare with what the OrderedLocusNames table had (from Microsoft Access). From doing this, it was easier to find which genes were not found in the .gdb file and made it easier to look through them in the UniProt XML file. With the Excel file comparing the lists of gene IDs and using the CTRL+F shortcut, I was also able to discern which tags to include into the new builds for the databases. Because of this, I was able to confirm that some genes indeed do not exist in the XML file, while only a couple exist within the "dbReference" tag.
# What didn't work?
#* What didn't work is using Match multiple times without thinking. Even when I was trying to match the number of gene IDs with what Tally Engine gives me, Match didn't really help me in identifying where to find the genes in the XML file.
# What will I do next to fix what didn't work?
#* What I would do next to fix what didn't work is to actually use Match in conjunction to the XML file, or just use the Excel method completely since that was actually more helpful in finding the necessary tags than the Match method.

{{Template:Troque_Journal}}

Oregon Trail Survivors

2015-12-08T05:58:32Z

Troque: /* Reflection */ Edited Trixie's reflection

<div style="text-align: center; font-size: 250%; line-height: 1.25em">'''Oregon Trail Survivors'''</div>

<div class="center">
[[Image:Oregon-trail-dysentery 5 biodb.jpg | thumb | right | 350px | The third leading cause of death in the Oregon Trail.]]
</div>

== Group Members ==
*Coder: [[User:Jwoodlee | Jake Woodlee]]
*Quality Assurance: [[User:Troque | Trixie Roque]]
*GenMAPP Users: [[User:Eyanosch | Erich Yanoschik]] & [[User:Kzebrows | Kristin Zebrowski]]
* Project Manager: [[User:Kzebrows | Kristin Zebrowski]]

{{Template:Oregon Trail Survivors}}

=== Presentation (QA/Coder) ===
* PDF can be seen [[Media: Genome Paper Presentation BioDB.pdf | here]]

===Group Meeting Times===
Thursday, November 5th at 8:00 pm

== Goals ==
Over the upcoming weeks our group will be investigating ''Shigella flexneri''.

====Week 10====

# Find genome sequence paper
# Find 4-8 microarray data and paper that goes with the genome paper
# Compile team page to and create a ranked annotated bibliography

====Week 11====

#Prepare for journal club presentations in Weeks 12 and 13
#Begin initial tasks on research project

Click on username links for more information regarding each team member's contributions for Week 11.

[[Jwoodlee Week 11 | Jake]]: Read through the genome paper and tried to get through the accessible things I had the ability to understand. Made an outline for the genome paper. Worked on the presentation with Trixie and found a database. And of course I answered the assigned questions.

[[Troque Week 11 | Trixie]]: Mainly focused on the Genome paper presentation with Jake. This includes searching for a viable database that we will be using for the rest of the group assignment and actually creating the presentation we will be doing for October 17th, 2015. I've also updated our group page to reflect what Dr. Dahlquist suggested would improve our team page.

[[Eyanosch Week 11 | Erich]]: Analyzed the microarray paper in order to describe the experimental design of the microarray data, treatments, number of replicates, and dye swaps. Worked with Kristin to produce the power point for the GennMAP users presentation at Journal Club. Worked on the individual journal entry and created an outline of the microarray paper.

[[Kzebrows Week 11 | Kristin]]: Using the team's selected microarray paper I developed an outline including background information, experimental outline/methods and how samples corresponded to the data, a brief description of the results, and a discussion including the implications of the research and its results in comparison to previous studies. Using this outline, I created a flow chart corresponding to the research. I also worked with Erich in order to create a PowerPoint for the Journal Club presentation on Nov. 24.

==== Week 12 ====
#QA will be doing an initial database export.
#Coder will be setting up version control.
#GenMAPP users will compile the raw data from the micorarray file to prepare for normalization and statistic analysis (will begin if time permits after consultation with Dr. Dahlquist). Additionally, the GenMAPP users will be determining the number of biological or technical replicates and how samples were labeled.
#Coder and QA will present on genome paper in class Tuesday, Nov. 24.

Click on username links for more information regarding each team member's contributions for Week 12.
* [[Jwoodlee Week 12 | Jake]]:Setup my environment in eclipse, created the s-flexneri branch, created my own copy of GenMAPP that I can modify for later use and I cloned the repository with the Git commands.
* [[Troque Week 12 | Trixie]]: Finished the preliminary export of the XML and GOA files and the corresponding Gene Testing Report. Also started identifying the gene id's for the specie. Decided on file management system with Jake.
* [[Eyanosch Week 12 | Erich]]: Worked with Kristin in determining the total number of biological and technical replicates. Compiled the raw data for RP samples, specifically the ID and Log ratio columns. Incorporated the RP and RX data into one spreadsheet with Kristins data. We created a table of the sample data and file each corresponds with, also figured out there were no dye swaps in the experiment(The control was the Cy3 dye and the treatment the Cy5 dye).
* [[Kzebrows Week 12 | Kristin]]: Determined that there were 3 biological replicates per treatment for 6 treatments total. Compiled raw data for RX samples by re-naming columns for ID and Log Ratio and putting into same worksheet, which was later combined with Erich's worksheet for RP samples. Erich and I met and worked together to create a table of which samples correspond to which file.

===Week 14===
#QA will be documenting the IDs using MATCH, Postgres, Microsoft Access, and Excel and get a head start of Milestone 3, which is customizing the TallyEngine.
#Coder will determine and document any modified export behavior that the GenMAPP Builder will have and resolve bugs. Coder will also work with QA by uploading GM Builder for additional export.
#GenMAPP Users will perform statistical analysis on Excel (normalization, tests) and format for import into GenMAPP. Users will also import data into GenMAPP and run MAPPFinder, and then document these test runs.

Click on username links for more information regarding each team member's contributions for Week 14.
* [[Jwoodlee Week 14 | Jake]]: Finished custom GenMAPP builder, committed to GitHub, and ran the export with the custom software. This created a custom .gdb which was opened in Microsoft Access and GenMAPP to check for accuracy.
* [[Troque Week 14 | Trixie]]: Trixie has finished identifying the gene IDs using MATCH, Postgres, Microsoft Access, and Excel. It was discovered that some IDs are in "dbReference/property&type&gene ID", and so another export was done on 12/7/15 to add the newly discovered gene IDs.
* [[Eyanosch Week 14 | Erich]]:
* [[Kzebrows Week 14 | Kristin]]: This week Erich and I made corrections from the talk page and normalized log ratios for the slides in the experiment. I completed the statistical analysis for RX samples and calculated the Bonferroni p value correction. I also performed a sanity check for the RX samples and, going off of that, I calculated the Benjamini & Hochberg p value correction for RX-1-30, which had the most statistically significant changes in gene expression. I also formatted and exported the file for GenMAPP, downloaded the database, and attempted to create color sets to run the data set through MappFINDER.

==== Reflection ====

Each team member should reflect on the team's progress:
# What worked?
# What didn't work?
# What will I do next to fix what didn't work?

''Kristin'':
#What worked in terms of communication is having a group text. We also meet at least once a week outside of class in order to work together on the assignments and make sure we are all on the same page. So far, this has allowed us to troubleshoot and address bugs together as a team quickly.
#After creating the initial compiled raw data file, I had to make several corrections before the file could be run through GenMAPP. First of all, I had to get rid of the ".", and I also had to change all #DIV/0! with a space character for the file to be read at all. Also, although we were unable to find all of the b#### and CP#### gene ID's in UniProt or ShiBASE. Also, after creating my color set and trying to run MAPPFinder, I tried three computers and all of them crashed with the "not responding" message.
#I will communicate with the QA and Coder in order to create a database with a minimal number of "Gene ID not found's" and then communicate with Erich when we try to run our dataset through MappFinder. Once the gene database is re-customized and the export is complete I can try and re-run my dataset to see if that makes a difference.

'' Trixie '':
# What worked?
#* What worked in identifying the gene IDs is to look export .gdb file into Excel and compare with what the OrderedLocusNames table had (from Microsoft Access). From doing this, it was easier to find which genes were not found in the .gdb file and made it easier to look through them in the UniProt XML file. With the Excel file comparing the lists of gene IDs and using the CTRL+F shortcut, I was also able to discern which tags to include into the new builds for the databases. Because of this, I was able to confirm that some genes indeed do not exist in the XML file, while only a couple exist within the "dbReference" tag. In terms of group work, what worked is posting all our files into a single page as we progress through the assignment. Night meetings were also helpful in order to better communicate with the rest of my group.
# What didn't work?
#* What didn't work is using Match multiple times without thinking. Even when I was trying to match the number of gene IDs with what Tally Engine gives me, Match didn't really help me in identifying where to find the genes in the XML file. Waiting for the database to finish didn't help much at all since our builds would take more than 4 hours to finish.
# What will I do next to fix what didn't work?
#* What I would do next to fix what didn't work is to actually use Match in conjunction to the XML file, or just use the Excel method completely since that was actually more helpful in finding the necessary tags than the Match method. I would probably have to time myself to check the lab after about 4.5 hours since one of our builds lasted that long.

==Overview of Genome Paper==
*Used the genome sequencing article to perform a prospective search in the [https://apps.webofknowledge.com/UA_GeneralSearch_input.do?product=UA&search_mode=GeneralSearch&SID=1FRKcNxUgxiGX6spITI&preferencesSaved= Web of Science] database.
*Overview of the search:
**How many articles does this article cite? 37
**How many articles cite this article? 303
**Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced?
***Now that the genome has been sequenced, a majority of research has been done on discovering which genes are responsible for virulence and pathogenesis as well as potential antibiotics. Genomic research is also focused on how ''S. flexneri'' has been able to develop resistance to multiple drugs. Furthermore, ''Shigella'' is suspected to have evolved from ''Escherichia coli'' so a lot of research has been done in how and when pathogenic ''Shigella'' split from ''E. coli'' on the evolutionary tree.

==Annotated Bibliography==
=== Genome Paper ===
Jin, Q., Yuan, Z., Xu, J., Wang, Y., Shen, Y., Lu, W., … Yu, J. (2002). Genome sequence of ''Shigella flexneri'' 2a: insights into pathogenicity through comparison with genomes of Escherichia coli K12 and O157. Nucleic Acids Research, 30(20), 4432–4441.
* PubMed Abstract: http://www.ncbi.nlm.nih.gov/pubmed/?term=Genome+sequence+of+Shigella+flexneri+2a%3A+insights+into+pathogenicity+through+comparison+with+genomes+of+Escherichia+coli+K12+and+O157
* PubMed Central: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC137130/
* Publisher Full Text (HTML): http://nar.oxfordjournals.org/content/30/20/4432.full
* Publisher Full Text (PDF): http://nar.oxfordjournals.org/content/30/20/4432.full.pdf+html
* Copyright: 2002 Oxford University Press
* Publisher: Oxford University Press
* Availability: in print and online
* Did LMU pay a fee for this article: no

===Microarray Paper===





Fu H, Liu L, Zhang X, Zhu Y, Zhao L, Peng J, et al. (2012) Common Changes in Global Gene Expression Induced by RNA Polymerase Inhibitors in ''shigella flexneri''. PLoS ONE 7(3): e33240. doi:10.1371/journal.pone.0033240

*The link to the [http://www.ncbi.nlm.nih.gov/pubmed/22428000 abstract]
*The link to the [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3299763/ full text of the article] in PubMed Central
*The link to the [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0033240 full text of the article] (HTML format) from the publisher web site.
*The link to the [http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0033240&representation=PDF full PDF version] of the article from the publisher web site.
*Copyright: © 2012 Fu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
*Does the journal own the copyright? NO
*Do the authors own the copyright? Yes
*Do the authors own the rights under a Creative Commons license? Yes
*Is the article available “Open Access”? Yes
*What organization is the publisher of the article? What type of organization is it? PLoS One is the publisher/Journal. It hosts open access research articles. (Public Library of Science)
*Is this article available in print or online only? Online only
*Has LMU paid a subscription or other fee for your access to this article? No LMU has not paid a subscription or other fee because it is open access on the Public Library of Science.
*Use the genome sequencing article you found to perform a prospective search in the ISI Web of Science/Knowledge database.
**How many articles does this article cite? 25 cited references
**How many articles cite this article? 0 articles cite this article
**Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced?
*Well given that there are no papers that cite this paper there hasn't been anything done to build on this specific topic. In regards to the genome I think this paper has built on the work of the people who sequenced the first genome of Shigella flexneri as well as the other micro array papers.
*State which database you used to find the data and article: ArrayExpress
*State what you used as search terms and what type of search terms they were: "shigella flexneri" filtered by organism, experiment type: "rna assay", experiment type: "array assay"
*Give an overview of the results of the search.
**How many results did you get? 7 results returned with 6 viable options due to the number assays.
**Give an assessment of how relevant the results were: Very relevant, 6/7 results were viable.
*Link to [http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-32978/?keywords=shigella+flexneri&organism=Shigella+flexneri&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array= microarray data]
*What experiment was performed? What was the "treatment" and what was the "control" in the experiment?
**Antibiotics (RNA Polymerase Inhibitors) were added to ''Shigella flexneri'' in order to see if bacteria became less active. The control was a group of bacteria with no drugs added to them, and the treatment was a group of bacteria with drugs added to them.
*Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each?
**There are two drugs RX and RP with 6 samples per drug. The experiment was run 3 times which yielded 36 assays. I believe that means 3 biological replicates and 12 technical replicates within each experiment, but I am not 100 percent sure.