LMU BioDB 2015 - User contributions [en]

The Class Whoopers

2015-12-22T18:03:01Z

Lenaolufson: /* Week 14 */ added my week 14 answers to the questions

= Team Information & Links =

{{Template:Class Whoopers}}

= Deliverables =
[[Bordetella Pertussis GenMAPP Analysis Deliverables]]

==Presentation Download Links==
*Journal Club
** Genome Paper: [[File:Genomepaper_cw20151116.pdf]]
** Microarray Paper: [[File: Microarray_Journal_Club_Presentation.pdf]]
*'''Final Project'''
**[[File:Bpertussis_findings_powerpoint.pdf]]

==File Naming Protocol==
All file types generated in this project will receive their own unique names composed of two key parts:
#Description
#*This will contain a brief, file-specific description of what content the file contains.
#*Descriptions for different versions of the same file will remain consistent.
#Identifier Tag
#*This tag will be listed as a suffix in the following form: "_cwYYYYMMDD"
#**cw- team name abbreviation
#**YYYYMMDD- date the file was created in the form year/month/day

Additionally, the following file naming best practices will be observed when creating descriptions for new files:
*Our species will be referred to consistently as "bpertussis".
*Spaces will be written as underscores.
*No capitalization will be used.
*No special characters will be used.
*If sequential numbering systems are used, leading zeros will be included for clarity.

=Weekly Updates=
==Week 15==
*'''Goals'''
**'''Assignment due date:''' Midnight Tuesday, December 15
** '''Coder:''' Adjust the GenMAPP Builder code to account for the one EnsemblBacteria reference ID that was missing in our last export; conduct a new import-export cycle to create the (hopefully) final .gdb file; begin characterizing the exported .gdb file in a Gene Database Testing Report; customize the GenMAPP Builder TallyEngine to account for any changes made.
**'''Quality Assurance:''' Reconfigure TallyEngine Configuration with Coder in order to accommodate missing gene IDs that were not exported the previous time. Test the revised database by running TallyEngine count, XmlpipeDB Match, and PostgreSQL. Locate missing gene IDs if any.
**'''GenMAPP User:''' Import data into GenMAPP, create ColorSets, and run MAPPFinder. Document and take notes on test runs with GenMAPP. Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb. Create a .mapp file showing one pathway that is changed in your data.

*'''Progress'''
**'''Brandon (Coder and Project Manager):''' I began this week by customizing the GenMAPP Builder TallyEngine to report ORF counts for ''Bordetella pertussis'' (see [[Bklein7_Week_15]]). After this, I worked with [[User:Msaeedi23|Mahrad]] to identify the 1 gene ID that was missing in the .gdb file [[File:bpertussis-std_cw20151203.zip]]. I found that this gene was a necessary EnsemblBacteria reference ID and edited the GenMAPP Builder code with the help of [[User:Dondi|Dr. Dionisio]] to include this ID in our next export (see [[Bklein7_Week_15]]). I conducted a complete import-export cycle on 12/10/2015 to create the .gdb file [[File:bpertussis-std_cw20151210.zip]]. I then characterized this export, authoring sections 1-5.2 of its testing report: [[Gene_Database_Testing_Report-_cw20151210]]. During our Sunday meeting, I worked with [[User:Lenaolufson|Lena]] to use this new gene database in GenMAPP. During our Monday meeting, I worked on our PowerPoint presentation: [[File:Bpertussis findings powerpoint.pdf]].
*** [[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 22:31, 14 December 2015 (PST)
**'''Mahrad (Quality Assurance):''' I worked closely with the coder [[User: Bklein7|Brandon]] in order to re-customize TallyEngine to include the 11 missing ORF genes. The specific customizations and following results are detailed in my [[Msaeedi23 Week 15| Week 15 Journal Entry]] Having located the missing gene IDs, Brandon went into Eclipse to code for them to be included in the export. Following this, we tested out revised gene database to make sure these missing IDs were actually exported. We ran TallyEngine count, which gave a total of 3446 gene IDs, demonstrating that the IDs were now exported. Then we ran XMLpipeDB Match, and this provided a total of 3447 gene IDs exported, one additional. Finally, we ran PostgreSQL and this gave a total of 3446 gene IDs. We came to find that gene "BP3167A" was in the original XML file, but not accounted for in the exported file. With further investigation we concluded that "BP3167A" is a reference ID from EnsemblBacteria and corresponds to the same ID as "BP3167.1" which was exported.
**'''Lena (GenMAPP User):''' I was able to import the data into GenMAPP and then I created color sets in order to run MAPPFinder. I obtained the ontology results and did some background research on what exactly the top results related to from the microarray article. I then used Kegg pathways for my specific organism to create two separate MAPPS, one for ribosome and one for the nitrogen cycle.

*'''Meetings!'''
**This week, our group used class work sessions to coordinate our work:
***Tuesday, December 8, 2:40 - 4:00
***Thursday, December 10, 2:40 - 4:00
**In addition, we scheduled meetings outside of class to work on the final PowerPoint Presentation and deliverables for our project:
***Sunday, December 13, 7:00 PM - 1:00 AM
***Monday, December 14, 2:00 PM - 11:00 PM

==Week 14==

*'''Goals'''
**'''Assignment due date:''' Midnight Tuesday, December 8
**'''Coder:''' Create the custom species profile for ''Bordetella pertussis'', run an export using the customized version of GenMAPP Builder, add further customizations to the custom species profile as appear necessary, and run a second export using the further customized version of GenMAPP Builder.
**'''Quality Assurance:''' Identify gene IDs that are missing in the first custom export, work with the coder to classify these IDs, configure the Tally Engine, and complete a gene database testing report for the second custom export.
**'''GenMAPP User:''' Complete the statistical analysis of the data, format the data for import into GenMAPP, and coordinate with the coder/QA to import this data into GenMAPP using the custom gene database.

*'''Progress & Reflection'''
**'''Brandon (Coder and Project Manager):''' This week, I focused on creating and customizing the species profile for ''Bordetella pertussis'' in GenMAPP Builder, the details of which can be found in my [[Bklein7 Week 14| Week 14 Journal Entry]]. I documented the first export I conducted using a custom ''Bordetella pertussis'' species profile here: [[Gene Database Testing Report- cw20151201]]. I demonstrated that the custom species information implemented in this export worked as intended, but Mahrad and I identified 11 ORF genes that failed to export. I updated the ''Bordetella pertussis'' species profile to account for these ORF genes and conducted a new export, detailed here: [[Gene Database Testing Report- cw20151203]]. Mahrad analyzed the exported .gdb file. In addition to this, I kept tabs on my fellow group members to keep us on track to accomplish our long-term project goals in a timely manner.
***What worked?
****Thus far, we have exported two versions of the ''Bordetella pertussis'' gene database that have been created using modified versions of GenMAPP Builder. Both custom exports worked as intended. The first one simply created the ''Bordtella pertussis'' custom class. However, we identified 11 ORF genes conforming to the unique patterns "BP####A" and "BP####B" that warranted inclusion into the gene database. Exporting ORF gene IDs is a common issue other custom classes appear to have had, so implementing this fix was very straightforward in practice.
***What didn't work?
****Although all of the changes we implemented to GenMAPP Builder worked as intended, we have yet to produce a comprehensive gene database for ''Bordtella pertussis''. The most recent export included 11 ORF genes that we thought encompassed the only IDs with the patterns "BP####A" and "BP####B". However, we found that there is one more relevant gene ID in the UniProt XML file that conforms to the patterns "BP####A" and was not imported. We will have to find a way to export this ID as well.
***What will I do next to fix what didn't work?
****Next, I will confer with Drs. Dahlquist and Dionisio to come up with a strategy for isolating the one missing EnsemblBacteria reference ID and exporting it into our final gene database. After this is done, I will characterize the database for completeness and work on further modifying the TallyEngine. Hopefully, these steps will generate a complete gene database so that we can transition to working on our final deliverables.
*** [[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 13:39, 7 December 2015 (PST)
**'''Mahrad (Quality Assurance):''' This week as Q and A I worked directly with Brandon to do the initial data exports. The work can be summarized here: [[Msaeedi23 Week 14| Week 14 Journal Entry]]. Next we meticulously characterized regular expression patterns to detect discrepancies in extracting the data from the original samples. In the following week I will work to do the tally configuration to customize it according to our specific species. Now I will focus on the tally configuration which may take some time and coding assistance from Brandon. Once the Tally Engine has been configured to our specific species, Lena can proceed with with GenMAPP processing. Week 14 reflection:
# What worked?
#*We were able to use the various counting systems to detect the total number of gene IDs that were imported into our gdb file. Through our investigation, Brandon and I came to find four specific missing IDs.
# What didn't work?
#*There were four ID inconsistencies detected to be missing in our gdb file. We were able to target the specific IDs that were missing and now the code will have to be changed to incorporate these missing IDs in our database.
# What will I do next to fix what didn't work?
#*Work more closely with Brandon to ensure the Tally Engine is configured properly and that we can properly import and obtain confirmation that all the gene IDs were imported successfully.
**'''Lena (GenMAPP User):''' This week, I made progress on performing the statistical analysis of the data to prepare it for GenMAPP. I was able to post my progress for each of the class working sessions on my [[Lenaolufson Week 14| Week 14 Journal Entry]] as I updated the excel data sheets after each session. Dr. Dahlquist helped me figure out a problem with the original raw data that was causing the values to be very skewed. I then sent her my updated data sheet and she was able to use a program to separate the duplicates of the chips. After she sent me back the data with the sorted values, I performed the statistical analysis on the data, the most updated version of the file can be found on my Week 14 journal entry linked previously.
#What worked?
#*I was able to perform the correct statistical alterations to the data in order to prepare it for submission to Dr. Dahlquist to run it through her program to split the data since there are duplicates of the genes. I had little trouble at all while working in excel and following the protocol from the Vibrio cholerae exercise, and I was able to adapt the protocol to fit my own data. Since there were a lot of columns with the dye swaps, I was careful to stay organized and name my columns with appropriate and easily identifiable names so that I would not get confused or mixed up. It was important for me to be meticulous as I was the only GenMAPP user for my group and so I did not have another person to check my work with.
#What didn't work?
#*This week I faced a challenge when I finished my calculations in excel because my values for the averages and standard dev. (and thus many other columns) were much too large. After consulting and looking over the data with Dr. Dahlquist, we were able to see that some of the gene ID values were labeled as 100000 or -100000, thus throwing the values way off. Upon detecting this problem, I had to go back into the original raw data I downloaded from the microarray site and check to see if this was an error included in their data or if it was a result of my own work. I found that the large numbers were included in the raw data, so with the assistance of Dr. Dahlquist again I deleted these large numbers out of my data, and it proved to solve the problem.
#What will I do next to fix what didn't work?
#*As I described above, I was able to figure out that the large 100000 and -100000 numbers were from the original raw data I downloaded, so they were not an error on my personal calculations applied in excel. I went into my data and replaced all of the large numbers with a blank space, and this proved to solve my problem as now my values were more logical and fit the numerical values that were desired.
[[User:Lenaolufson|Lenaolufson]] ([[User talk:Lenaolufson|talk]]) 19:54, 7 December 2015 (PST)

*'''Meetings!'''
**This week, our group used class work sessions to coordinate our work:
***Tuesday, December 1, 2:40 - 4:00
***Thursday, December 3, 2:40 - 4:00
*** Monday, December 7, 10:30 - 12 am

==Week 12==
*'''Goals'''
**'''Assignment due date:''' Midnight Tuesday, November 24
**'''Coder:''' Set up a GitHub repository clone of the XMLPipeDB project on your development device, the development rig, and the initial as-is build for gmbuilder. Complete an import-export cycle in association with QA.
**'''Quality Assurance:''' Complete an import-export cycle for the 1st ''Bordetella pertussis'' gene database. Complete a Gene Database Testing Report for this export.
**'''GenMAPP Users:''' Create a Master Raw Data file that contains the IDs and columns of data required for further analysis. Consult with Dr. Dahlquist on how to process the data (normalization, statistics).

*'''Progress'''
**'''Brandon (Quality Assurance and Interim Coder):''' This week, I focused on completing an import-export cycle for our first ''Bordetella pertussis'' gene database- [[File:Bpertussis-std cw20151119.zip]]. With my QA hat, I imported the appropriate data, exported the gene database, and discussed the gene database creation & counting protocol here- [[Gene Database Testing Report- cw20151119]]. With my Coder hat, I followed the instructions on the [[Coder| Coder Guild Page]] to setup a GitHub repository clone of the XMLPipdeDB project on my personal laptop, the Eclipse developer rig, and the initial as-is build for gmbuilder. The electronic lab notebook for my QA and Coder work is present on my [[Bklein7 Week 12| Week 12 Page]]. Finally, I wrote a PowerPoint presentation on our genome sequencing paper, which is linked to on my [[Bklein7 Week 12| Week 12 Page]] as well.
***[[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 18:48, 23 November 2015 (PST)
**'''Lena (GenMAPP):''' I worked on downloading the correct data sample files from the provided files on the microarray paper page. The files were unzipped and prepared to be imported into excel. In excel, the data was manipulated to form a spreadsheet that had all of the gene IDs from the different samples with their appropriate columns to be analyzed. The corrections and further manipulations of the data are to be continued to be done in the coming week in order to create the desired dataset to be exported from excel. [[File:Bpertussis CompiledRawData MS2015.xlsx]]
***[[User:Lenaolufson|Lenaolufson]] ([[User talk:Lenaolufson|talk]]) 17:33, 23 November 2015 (PST)
**'''Mahrad (GenMAPP--> Quality Assurance)''': This week I downloaded the six data sample files provided by the microarray paper. The process is detailed in my [[Msaeedi23 Week 12| Week 12 Journal Entry]]. Files were unzipped, imported into excel, and manipulated to form a single spreadsheet containing all gene IDs from the different samples. Each sample was placed in its respective column to be further analyzed and manipulated in the upcoming week. Following this, I assumed the position of quality assurance to accommodate the absence of Nicole.
** '''Nicole''' was absent this week. [[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 18:52, 23 November 2015 (PST)

*'''Meetings!'''
** Monday, November 23: Seaver 120- Brandon and Lena met to work on the GenMAPP testing of the gene IDs from our database.

==Week 11==
*'''Goals'''
** For all:
*** Outline your assigned paper on your user page and include a list of 10 defined terms from the paper.
**Nicole & Brandon
***Prepare Journal Club presentation on the designated genome sequencing article
***Slides Due: by midnight, Tuesday, November 17
***Presentation Date: Tuesday, November 24
**Lena & Mahrad
***Prepare Journal Club presentation on the designated microarray paper
***Slides Due: by midnight, Tuesday, November 17
***Presentation Date: Tuesday, November 17

*'''Progress'''
**Nicole Anguiano (Coder): Nicole was absent this week for a medical emergency and is (hopefully) getting some much deserved rest. [[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 23:14, 16 November 2015 (PST)
**Brandon Klein (QA): This week I made several edits to the [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/The_Class_Whoopers Class Whoopers Team Page] in accordance with the [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/Week_11 Week 11 assignment]. These edits included the following: revising the Class Whoopers template, reorganizing the Team Page structure, commenting out unneeded articles in the annotated bibliography, creating the new bibliography entry as requested by Dr. Dahlquist, and writing the naming conventions for our files. Additionally, I outlined our genome sequencing paper for "Bordetella pertussis" and assessed the [http://www.genedb.org/Homepage/Bpertussis GeneDB MOD] on my [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/Bklein7_Week_11#Identifying_the_Bordetella_Pertussis_MOD Week 11 Individual Journal Entry]. A preliminary draft of the genome sequencing paper that I will likely be presenting solo was uploaded there. Finally, I kept tabs on group members as the interim Project Manager. [[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 23:14, 16 November 2015 (PST)
**Lena Olufson (GennMAPP): This week Mahrad and I met up and analyzed the microarray paper together. We split up the powerpoint into two halves; I did the introduction/significance of the study as well as the methods performed. Mahrad and I created our presentation together and worked through a google doc to edit it simultaneously as we discussed out loud. We also created a flow chart together that demonstrated the experimental design, thus we have the same ones included in our individual assignments. We made sure to check in with the temporary project manager and keep him updated on our progress. [[User:Lenaolufson|Lenaolufson]] ([[User talk:Lenaolufson|talk]]) 23:24, 16 November 2015 (PST)
**Mahrad Saeedi (GennMAPP): This week Lena and I worked on analyzing the microarray paper and creating an outline. The outline and detailed process involved with the experiment can be found in my [[Msaeedi23 Week 11| Week 11 Journal Entry]]. We each defined 10 terms separately based upon words we didn't recognize in the article. We then proceeded to producing the powerpoint presentation for journal club.
[[User:Msaeedi23|Msaeedi23]] ([[User talk:Msaeedi23|talk]]) 23:46, 16 November 2015 (PST)

*'''Meetings!'''
**11/15- Lena & Mahrad met to work on outlining article and answering questions
**11/16- Lena & Mahrad met to prepare powerpoint presentation for journal club

==Week 10==
*'''Goals'''
** For all:
*** Create an annotated bibliography including one genome sequencing paper and two microarray experiments for Bordetella pertussis
*** Create/update team page & compile group annotated bibliography
*** Assignment due date: Midnight Tuesday, November 10

*'''Progress'''
**All group members created annotated bibliographies and compiled them on the newly created group page.

*'''Meetings!'''
** Monday, November 9, 8pm-9pm, Seaver 120

= Annotated Bibliography =
== Genome Sequencing Paper ==

Neither of these papers is the ''first'' to report the genome sequence of ''B. pertussis.'' The paper that you will want to use is [http://www.nature.com/ng/journal/v35/n1/full/ng1227.html this one]. I found it by looking at the introduction and references of the Zhang et. al (2011) paper. For your Week 11 assignment, please remove your annotated bibliography entries for the two papers below and create one for this new paper by Parkhill et al. (2003). You will use the Parkhill paper for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 09:54, 10 November 2015 (PST)''

*Parkhill, J., Sebaihia, M., Preston, A., Murphy, L. D., et al. (2003). Comparative analysis of the genome sequences of Bordetella pertussis, Bordetella parapertussis and Bordetella bronchiseptica. Nature genetics, 35(1), 32-40. doi:10.1038/ng1227
* PubMed Abstract: http://www.ncbi.nlm.nih.gov/pubmed/12910271
* PubMed Central: Not available on PubMed Central.
* Publisher Full Text (HTML): http://www.nature.com/ng/journal/v35/n1/full/ng1227.html
* Publisher Full Text (PDF): http://www.nature.com/ng/journal/v35/n1/pdf/ng1227.pdf
* Copyright: ©2003 Nature Publishing Group (information found on PDF version of article). This article is not Open Access, but it is freely available 6 months after publication.
* Publisher: Nature Publishing Group (for-profit).
* Availability: In print and online.
* Did LMU pay a fee for this article: Yes, LMU pays a subscription fee for access to the journal ''Nature Genetics''.

== Microarray Paper ==

This paper is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:04, 10 November 2015 (PST)''

Hoo, R., Lam, J.H., Huot, L., Pant, A., Li, R., Hot, D., & Alonso, S. (2014). Evidence for a Role of the Polysaccharide Capsule Transport Proteins in Pertussis Pathogenesis. PLoS ONE, 9(12):e115243. doi: 10.1371/journal.pone.0115243
* PubMed Abstract: http://www.ncbi.nlm.nih.gov/pubmed/25501560
* PubMed Central: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4264864/
* Publisher Full Text (HTML): http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115243
* Publisher Full Text (PDF): http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0115243&representation=PDF
* Copyright: © 2014 Hoo et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited (info found [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115243 here]).
* Publisher: PLOS ONE (respected open access organization).
* Availability: Online only.
* Did LMU pay a fee for this article: No.
* Web site where the data resides: [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62088 NCBI GEO data]

Bordetella Pertussis GenMAPP Analysis Deliverables

2015-12-18T22:19:34Z

Lenaolufson: /* Group Files and Datasets */ updated the mapp file for ribosome

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species: [[Media:Bpertussis-std cw20151210.zip]]
* ReadMe file to accompany the Gene Database: [[Media:ReadMe bpertussis-std cw20151210.docx]]
** [[Media:Bpertussis genedatabase schema cw20151210.jpg|Gene Database Schema diagram (also included in ReadMe)]]
* Gene Database Testing Report for final submitted Gene Database: [[Media:Gdb testingreport cw20151210.pdf]]
* Processed and analyzed DNA microarray dataset: [[Media:Bpertussis compiledrawdata cw20151208.xlsx]]
* Data file used for import into GenMAPP: [[Media:Bpertussis compiledrawdata cw20151208.txt]]
* GenMAPP Expression Dataset file: [[Media:Bpertussis compiledrawdata cw20151218.gex]]
* Exceptions file of data imported into GenMAPP: [[Media:Bpertussis compiledrawdata cw20151218.EX.txt]]
* Raw MAPPFinder results files:
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion0-GO.txt|Increased]]
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion1-GO.txt|Decreased]]
* ''.gmf'' file: [[Media:Bpertussis compiledrawdata cw20151218.gmf]]
* Filtered MAPPFinder Results:
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion0-GO - Filtered.xlsx|Increased]]
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion1-GO - Filtered.xlsx|Decreased]]
* Sample MAPP file of a relevant biological pathway for your species: [[File: Bpertussis ribosomepathway cw20151218.mapp]]
* '''[[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis (''.doc'', ''.docx'', or ''.pdf'')'''
* PowerPoint presentation: [[Media:Bpertussis findings powerpoint.pdf]]

==Team Information & Links==
{{Template:Class Whoopers}}

File:Bpertussis ribosomepathway cw20151218.mapp

2015-12-18T22:18:41Z

Lenaolufson: Lenaolufson uploaded a new version of File:Bpertussis ribosomepathway cw20151218.mapp

new ribosome mapp with the BHpvalue criteria

Gene Database Testing Report- cw20151210

2015-12-18T22:17:30Z

Lenaolufson: /* Ribosome Kegg Pathway */ added new mapp jpeg of ribosome

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*We download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on our computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* We went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, we navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** We clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**We extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, we navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by the browser. Therefore, we had to manually download the file.

====GO OBO-XML====

* We downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
* We extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# We downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# We extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* We launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, we created a new database: ''bpertussis_cw20151210_gmb3build5''.
** We opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** We clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** We clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, we confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, we launched gmbuilder.bat.
* We selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* We selected File > Export to GenMAPP Gene Database... to begin the export process.
* We typed in our coder's name in the owner field (Brandon Klein).
* We selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* We checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, we clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* We ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*We entered the project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*We used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Note: This query was crafted by [[User:Dondi|Dr. Dionisio]].

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

We opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
We visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
*[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[File:Bpertussis compiledrawdata cw20151208.txt]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: BP0101, BP1677, BP0910A, and BP2029A.
****Searching for any of these gene IDs in UniProt returns the message "Sorry, no results found for your search term.":
*****[[File:ErroneousID Uniprot cw20151210.PNG]]
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] > 0.25 AND [B-H_Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] < -0.25 AND [B-H_Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expression dataset BHpvalue criteria.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151218" was applied to the genes to color code them.
**Here is the picture of the final mapp for the ribosome pathway created:
* [[File: Bpertussis ribosomepathway cw20151218.jpg]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151218" was applied to the genes to color code them.
** Here is the picture of the final mapp for the nitrogen cycle pathway created:
* [[File:Finalnitrogencyclebpertussis cw20151218.jpg]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:GeneontologyresultsBHpvalue.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

To assess the completeness of this version of the ''Bordetella pertussis'' gene database, we explored the original genome sequencing data from Parkhill et al. (2003) that was deposited at the [http://www.genedb.org/Homepage/Bpertussis GeneDB Model Organism Database (MOD)]. From the GeneDB Home Page, we accessed a ''Gene Type'' search function that was used to quantify the number of gene listings present under each provided gene category. The results of this investigation are presented below.

====Protein-Coding Genes====
[[File:GDB protein-coding.png]]
*There are 3447 protein-coding genes present in the [http://www.genedb.org/Homepage/Bpertussis GeneDB] database. This result verified that the set of protein-coding genes exported into [[File:Bpertussis-std cw20151210.zip]] from UniProt is complete. No further changes to the gene database export procedures are necessary at this time.

====Non-Protein Genome Features====

#Pseudogenes
#*[[File:GDB_pseudogenes.png]]
#**GeneDB indicated that 359 pseudogenes are present in the ''B. pertussis'' genome. Pseudogenes do not code for proteins and were therefore not included in the original UniProt listing.
#rRNA
#*[[File:GDB_rRNA.png]]
#**GeneDB indicated that 9 genes that encode for rRNA are present in the ''B. pertussis'' genome. These genes do not code for proteins and were therefore not included in the original UniProt listing.
#tRNA
#*[[File:GDB_tRNA.png]]
#**GeneDB indicated that 51 genes that encode for tRNA are present in the ''B. pertussis'' genome. These genes do not code for proteins and were therefore not included in the original UniProt listing.
#snoRNA
#*GeneDB retrieved 0 genes that encode for snoRNA.
#snRNA
#*GeneDB retrieved 0 genes that encode for snRNA.
#"miscRNA"
#*GeneDB retrieved 0 genes that encode for "miscRNA".

'''A total of 419 non-protein coding genes were identified in the ''Bordetella pertussis'' genome in addition to the 3447 protein-coding genes captured in our gene database.'''

==Team Information & Links==

{{Template:Class Whoopers}}

File:Bpertussis ribosomepathway cw20151218.jpg

2015-12-18T22:16:51Z

Lenaolufson: jpg of ribosome mapp with Bhpvalue with correct datset

jpg of ribosome mapp with Bhpvalue with correct datset

Gene Database Testing Report- cw20151210

2015-12-18T22:10:22Z

Lenaolufson: /* Nitrogen Cycle Kegg Pathway */ added new nitrogen cycle BHpvalue criteria jpeg

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*We download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on our computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* We went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, we navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** We clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**We extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, we navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by the browser. Therefore, we had to manually download the file.

====GO OBO-XML====

* We downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
* We extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# We downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# We extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* We launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, we created a new database: ''bpertussis_cw20151210_gmb3build5''.
** We opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** We clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** We clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, we confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, we launched gmbuilder.bat.
* We selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* We selected File > Export to GenMAPP Gene Database... to begin the export process.
* We typed in our coder's name in the owner field (Brandon Klein).
* We selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* We checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, we clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* We ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*We entered the project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*We used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Note: This query was crafted by [[User:Dondi|Dr. Dionisio]].

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

We opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
We visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
*[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[File:Bpertussis compiledrawdata cw20151208.txt]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: BP0101, BP1677, BP0910A, and BP2029A.
****Searching for any of these gene IDs in UniProt returns the message "Sorry, no results found for your search term.":
*****[[File:ErroneousID Uniprot cw20151210.PNG]]
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] > 0.25 AND [B-H_Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] < -0.25 AND [B-H_Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expression dataset BHpvalue criteria.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151218" was applied to the genes to color code them.
**Here is the picture of the final mapp for the ribosome pathway created:
* [[File:Bpertussis ribosomepathway cw20151215.jpg]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151218" was applied to the genes to color code them.
** Here is the picture of the final mapp for the nitrogen cycle pathway created:
* [[File:Finalnitrogencyclebpertussis cw20151218.jpg]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:GeneontologyresultsBHpvalue.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

To assess the completeness of this version of the ''Bordetella pertussis'' gene database, we explored the original genome sequencing data from Parkhill et al. (2003) that was deposited at the [http://www.genedb.org/Homepage/Bpertussis GeneDB Model Organism Database (MOD)]. From the GeneDB Home Page, we accessed a ''Gene Type'' search function that was used to quantify the number of gene listings present under each provided gene category. The results of this investigation are presented below.

====Protein-Coding Genes====
[[File:GDB protein-coding.png]]
*There are 3447 protein-coding genes present in the [http://www.genedb.org/Homepage/Bpertussis GeneDB] database. This result verified that the set of protein-coding genes exported into [[File:Bpertussis-std cw20151210.zip]] from UniProt is complete. No further changes to the gene database export procedures are necessary at this time.

====Non-Protein Genome Features====

#Pseudogenes
#*[[File:GDB_pseudogenes.png]]
#**GeneDB indicated that 359 pseudogenes are present in the ''B. pertussis'' genome. Pseudogenes do not code for proteins and were therefore not included in the original UniProt listing.
#rRNA
#*[[File:GDB_rRNA.png]]
#**GeneDB indicated that 9 genes that encode for rRNA are present in the ''B. pertussis'' genome. These genes do not code for proteins and were therefore not included in the original UniProt listing.
#tRNA
#*[[File:GDB_tRNA.png]]
#**GeneDB indicated that 51 genes that encode for tRNA are present in the ''B. pertussis'' genome. These genes do not code for proteins and were therefore not included in the original UniProt listing.
#snoRNA
#*GeneDB retrieved 0 genes that encode for snoRNA.
#snRNA
#*GeneDB retrieved 0 genes that encode for snRNA.
#"miscRNA"
#*GeneDB retrieved 0 genes that encode for "miscRNA".

'''A total of 419 non-protein coding genes were identified in the ''Bordetella pertussis'' genome in addition to the 3447 protein-coding genes captured in our gene database.'''

==Team Information & Links==

{{Template:Class Whoopers}}

File:Finalnitrogencyclebpertussis cw20151218.jpg

2015-12-18T22:09:20Z

Lenaolufson: nitrogen cycle mapp with Bhpvalue criteria

nitrogen cycle mapp with Bhpvalue criteria

Gene Database Testing Report- cw20151210

2015-12-18T22:05:25Z

Lenaolufson: /* Creating a New Color Set */ added new BHpvalue criteria screenshot

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*We download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on our computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* We went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, we navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** We clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**We extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, we navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by the browser. Therefore, we had to manually download the file.

====GO OBO-XML====

* We downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
* We extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# We downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# We extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* We launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, we created a new database: ''bpertussis_cw20151210_gmb3build5''.
** We opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** We clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** We clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, we confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, we launched gmbuilder.bat.
* We selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* We selected File > Export to GenMAPP Gene Database... to begin the export process.
* We typed in our coder's name in the owner field (Brandon Klein).
* We selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* We checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, we clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* We ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*We entered the project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*We used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Note: This query was crafted by [[User:Dondi|Dr. Dionisio]].

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

We opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
We visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
*[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[File:Bpertussis compiledrawdata cw20151208.txt]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: BP0101, BP1677, BP0910A, and BP2029A.
****Searching for any of these gene IDs in UniProt returns the message "Sorry, no results found for your search term.":
*****[[File:ErroneousID Uniprot cw20151210.PNG]]
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] > 0.25 AND [B-H_Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] < -0.25 AND [B-H_Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expression dataset BHpvalue criteria.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151218" was applied to the genes to color code them.
**Here is the picture of the final mapp for the ribosome pathway created:
* [[File:Bpertussis ribosomepathway cw20151215.jpg]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:GeneontologyresultsBHpvalue.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

To assess the completeness of this version of the ''Bordetella pertussis'' gene database, we explored the original genome sequencing data from Parkhill et al. (2003) that was deposited at the [http://www.genedb.org/Homepage/Bpertussis GeneDB Model Organism Database (MOD)]. From the GeneDB Home Page, we accessed a ''Gene Type'' search function that was used to quantify the number of gene listings present under each provided gene category. The results of this investigation are presented below.

====Protein-Coding Genes====
[[File:GDB protein-coding.png]]
*There are 3447 protein-coding genes present in the [http://www.genedb.org/Homepage/Bpertussis GeneDB] database. This result verified that the set of protein-coding genes exported into [[File:Bpertussis-std cw20151210.zip]] from UniProt is complete. No further changes to the gene database export procedures are necessary at this time.

====Non-Protein Genome Features====

#Pseudogenes
#*[[File:GDB_pseudogenes.png]]
#**GeneDB indicated that 359 pseudogenes are present in the ''B. pertussis'' genome. Pseudogenes do not code for proteins and were therefore not included in the original UniProt listing.
#rRNA
#*[[File:GDB_rRNA.png]]
#**GeneDB indicated that 9 genes that encode for rRNA are present in the ''B. pertussis'' genome. These genes do not code for proteins and were therefore not included in the original UniProt listing.
#tRNA
#*[[File:GDB_tRNA.png]]
#**GeneDB indicated that 51 genes that encode for tRNA are present in the ''B. pertussis'' genome. These genes do not code for proteins and were therefore not included in the original UniProt listing.
#snoRNA
#*GeneDB retrieved 0 genes that encode for snoRNA.
#snRNA
#*GeneDB retrieved 0 genes that encode for snRNA.
#"miscRNA"
#*GeneDB retrieved 0 genes that encode for "miscRNA".

'''A total of 419 non-protein coding genes were identified in the ''Bordetella pertussis'' genome in addition to the 3447 protein-coding genes captured in our gene database.'''

==Team Information & Links==

{{Template:Class Whoopers}}

File:Expression dataset BHpvalue criteria.png

2015-12-18T22:04:08Z

Lenaolufson: new color set criteria for BHpvalue

new color set criteria for BHpvalue

Gene Database Testing Report- cw20151210

2015-12-18T22:01:54Z

Lenaolufson: /* Running MAPPFinder */ new Go results screnshot with BHpvalue

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*We download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on our computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* We went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, we navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** We clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**We extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, we navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by the browser. Therefore, we had to manually download the file.

====GO OBO-XML====

* We downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
* We extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# We downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# We extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* We launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, we created a new database: ''bpertussis_cw20151210_gmb3build5''.
** We opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** We clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** We clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, we confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, we launched gmbuilder.bat.
* We selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* We selected File > Export to GenMAPP Gene Database... to begin the export process.
* We typed in our coder's name in the owner field (Brandon Klein).
* We selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* We checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, we clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* We ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*We entered the project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*We used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Note: This query was crafted by [[User:Dondi|Dr. Dionisio]].

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

We opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
We visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
*[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[File:Bpertussis compiledrawdata cw20151208.txt]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: BP0101, BP1677, BP0910A, and BP2029A.
****Searching for any of these gene IDs in UniProt returns the message "Sorry, no results found for your search term.":
*****[[File:ErroneousID Uniprot cw20151210.PNG]]
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151218" was applied to the genes to color code them.
**Here is the picture of the final mapp for the ribosome pathway created:
* [[File:Bpertussis ribosomepathway cw20151215.jpg]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:GeneontologyresultsBHpvalue.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

To assess the completeness of this version of the ''Bordetella pertussis'' gene database, we explored the original genome sequencing data from Parkhill et al. (2003) that was deposited at the [http://www.genedb.org/Homepage/Bpertussis GeneDB Model Organism Database (MOD)]. From the GeneDB Home Page, we accessed a ''Gene Type'' search function that was used to quantify the number of gene listings present under each provided gene category. The results of this investigation are presented below.

====Protein-Coding Genes====
[[File:GDB protein-coding.png]]
*There are 3447 protein-coding genes present in the [http://www.genedb.org/Homepage/Bpertussis GeneDB] database. This result verified that the set of protein-coding genes exported into [[File:Bpertussis-std cw20151210.zip]] from UniProt is complete. No further changes to the gene database export procedures are necessary at this time.

====Non-Protein Genome Features====

#Pseudogenes
#*[[File:GDB_pseudogenes.png]]
#**GeneDB indicated that 359 pseudogenes are present in the ''B. pertussis'' genome. Pseudogenes do not code for proteins and were therefore not included in the original UniProt listing.
#rRNA
#*[[File:GDB_rRNA.png]]
#**GeneDB indicated that 9 genes that encode for rRNA are present in the ''B. pertussis'' genome. These genes do not code for proteins and were therefore not included in the original UniProt listing.
#tRNA
#*[[File:GDB_tRNA.png]]
#**GeneDB indicated that 51 genes that encode for tRNA are present in the ''B. pertussis'' genome. These genes do not code for proteins and were therefore not included in the original UniProt listing.
#snoRNA
#*GeneDB retrieved 0 genes that encode for snoRNA.
#snRNA
#*GeneDB retrieved 0 genes that encode for snRNA.
#"miscRNA"
#*GeneDB retrieved 0 genes that encode for "miscRNA".

'''A total of 419 non-protein coding genes were identified in the ''Bordetella pertussis'' genome in addition to the 3447 protein-coding genes captured in our gene database.'''

==Team Information & Links==

{{Template:Class Whoopers}}

File:GeneontologyresultsBHpvalue.png

2015-12-18T22:01:26Z

Lenaolufson: new Go results with the BHpvalue

new Go results with the BHpvalue

Gene Database Testing Report- cw20151210

2015-12-18T22:00:07Z

Lenaolufson: /* Ribosome Kegg Pathway */

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*We download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on our computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* We went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, we navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** We clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**We extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, we navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by the browser. Therefore, we had to manually download the file.

====GO OBO-XML====

* We downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
* We extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# We downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# We extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* We launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, we created a new database: ''bpertussis_cw20151210_gmb3build5''.
** We opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** We clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** We clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, we confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, we launched gmbuilder.bat.
* We selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* We selected File > Export to GenMAPP Gene Database... to begin the export process.
* We typed in our coder's name in the owner field (Brandon Klein).
* We selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* We checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, we clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* We ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*We entered the project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*We used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Note: This query was crafted by [[User:Dondi|Dr. Dionisio]].

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

We opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
We visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
*[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[File:Bpertussis compiledrawdata cw20151208.txt]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: BP0101, BP1677, BP0910A, and BP2029A.
****Searching for any of these gene IDs in UniProt returns the message "Sorry, no results found for your search term.":
*****[[File:ErroneousID Uniprot cw20151210.PNG]]
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151218" was applied to the genes to color code them.
**Here is the picture of the final mapp for the ribosome pathway created:
* [[File:Bpertussis ribosomepathway cw20151215.jpg]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

To assess the completeness of this version of the ''Bordetella pertussis'' gene database, we explored the original genome sequencing data from Parkhill et al. (2003) that was deposited at the [http://www.genedb.org/Homepage/Bpertussis GeneDB Model Organism Database (MOD)]. From the GeneDB Home Page, we accessed a ''Gene Type'' search function that was used to quantify the number of gene listings present under each provided gene category. The results of this investigation are presented below.

====Protein-Coding Genes====
[[File:GDB protein-coding.png]]
*There are 3447 protein-coding genes present in the [http://www.genedb.org/Homepage/Bpertussis GeneDB] database. This result verified that the set of protein-coding genes exported into [[File:Bpertussis-std cw20151210.zip]] from UniProt is complete. No further changes to the gene database export procedures are necessary at this time.

====Non-Protein Genome Features====

#Pseudogenes
#*[[File:GDB_pseudogenes.png]]
#**GeneDB indicated that 359 pseudogenes are present in the ''B. pertussis'' genome. Pseudogenes do not code for proteins and were therefore not included in the original UniProt listing.
#rRNA
#*[[File:GDB_rRNA.png]]
#**GeneDB indicated that 9 genes that encode for rRNA are present in the ''B. pertussis'' genome. These genes do not code for proteins and were therefore not included in the original UniProt listing.
#tRNA
#*[[File:GDB_tRNA.png]]
#**GeneDB indicated that 51 genes that encode for tRNA are present in the ''B. pertussis'' genome. These genes do not code for proteins and were therefore not included in the original UniProt listing.
#snoRNA
#*GeneDB retrieved 0 genes that encode for snoRNA.
#snRNA
#*GeneDB retrieved 0 genes that encode for snRNA.
#"miscRNA"
#*GeneDB retrieved 0 genes that encode for "miscRNA".

'''A total of 419 non-protein coding genes were identified in the ''Bordetella pertussis'' genome in addition to the 3447 protein-coding genes captured in our gene database.'''

==Team Information & Links==

{{Template:Class Whoopers}}

Gene Database Testing Report- cw20151210

2015-12-18T21:59:51Z

Lenaolufson: /* Ribosome Kegg Pathway */ added new picture of mapp with BHpvalue criteria

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*We download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on our computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* We went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, we navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** We clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**We extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, we navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by the browser. Therefore, we had to manually download the file.

====GO OBO-XML====

* We downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
* We extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# We downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# We extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* We launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, we created a new database: ''bpertussis_cw20151210_gmb3build5''.
** We opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** We clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** We clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, we confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, we launched gmbuilder.bat.
* We selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* We selected File > Export to GenMAPP Gene Database... to begin the export process.
* We typed in our coder's name in the owner field (Brandon Klein).
* We selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* We checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, we clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* We ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*We entered the project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*We used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Note: This query was crafted by [[User:Dondi|Dr. Dionisio]].

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

We opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
We visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
*[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[File:Bpertussis compiledrawdata cw20151208.txt]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: BP0101, BP1677, BP0910A, and BP2029A.
****Searching for any of these gene IDs in UniProt returns the message "Sorry, no results found for your search term.":
*****[[File:ErroneousID Uniprot cw20151210.PNG]]
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151218" was applied to the genes to color code them.
**Here is the screenshot of the final mapp for the ribosome pathway created:
* [[File:Bpertussis ribosomepathway cw20151215.jpg]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

To assess the completeness of this version of the ''Bordetella pertussis'' gene database, we explored the original genome sequencing data from Parkhill et al. (2003) that was deposited at the [http://www.genedb.org/Homepage/Bpertussis GeneDB Model Organism Database (MOD)]. From the GeneDB Home Page, we accessed a ''Gene Type'' search function that was used to quantify the number of gene listings present under each provided gene category. The results of this investigation are presented below.

====Protein-Coding Genes====
[[File:GDB protein-coding.png]]
*There are 3447 protein-coding genes present in the [http://www.genedb.org/Homepage/Bpertussis GeneDB] database. This result verified that the set of protein-coding genes exported into [[File:Bpertussis-std cw20151210.zip]] from UniProt is complete. No further changes to the gene database export procedures are necessary at this time.

====Non-Protein Genome Features====

#Pseudogenes
#*[[File:GDB_pseudogenes.png]]
#**GeneDB indicated that 359 pseudogenes are present in the ''B. pertussis'' genome. Pseudogenes do not code for proteins and were therefore not included in the original UniProt listing.
#rRNA
#*[[File:GDB_rRNA.png]]
#**GeneDB indicated that 9 genes that encode for rRNA are present in the ''B. pertussis'' genome. These genes do not code for proteins and were therefore not included in the original UniProt listing.
#tRNA
#*[[File:GDB_tRNA.png]]
#**GeneDB indicated that 51 genes that encode for tRNA are present in the ''B. pertussis'' genome. These genes do not code for proteins and were therefore not included in the original UniProt listing.
#snoRNA
#*GeneDB retrieved 0 genes that encode for snoRNA.
#snRNA
#*GeneDB retrieved 0 genes that encode for snRNA.
#"miscRNA"
#*GeneDB retrieved 0 genes that encode for "miscRNA".

'''A total of 419 non-protein coding genes were identified in the ''Bordetella pertussis'' genome in addition to the 3447 protein-coding genes captured in our gene database.'''

==Team Information & Links==

{{Template:Class Whoopers}}

Lenaolufson Week 15

2015-12-18T21:56:06Z

Lenaolufson: /* Running MAPPFinder */

=12/8/15=
*It was now time for me to prepare my file for GenMAPP, and I did so by the [http://www.openwetware.org/wiki/BIOL398-01/S10:Sample_Microarray_Analysis_Vibrio_cholerae ''Vibrio cholerae'' instructions found here.]]
* I inserted a new worksheet and named it "forGenMAPP".
* I went back to the "statistics" worksheet and Selected All and Copied.
*I went to my new sheet and clicked on cell A1 and selected Paste Special, clicked on the Values radio button, and clicked OK.
** I then deleted the ID columns besides the far left one in column A, and I deleted the second MasterIndex column because it was unnecessary.
** I added a "1" before all of the titles of columns D through I so that none of the columns would have the same names due to the replicates.
* I selected Columns V through Y (all the fold changes). I selected the menu item Format > Cells. Under the number tab, I selected 2 decimal places. I clicked OK.
* I selected all the columns containing p values. I selected the menu item Format > Cells. Under the number tab, I selected 4 decimal places. I clicked OK.
* I deleted the left-most Bonferroni p value column, preserving the one that showed the result of my "if" statement.
* I inserted a column to the right of the "ID" column. I typed the header "SystemCode" into the top cell of this column. I filled the entire column (each cell) with the letter "N".
* I selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu.
*After preparing it for GenMAPP, here are the .xls and .txt files:
** [[File:Bpertussis CompiledRawData MS2015-3.xlsx]]
** [[File:Bpertussis CompiledRawData MS2015-3.txt]]
* Then it was time to perform a sanity check, which was done using the [http://www.openwetware.org/wiki/BIOL398-01/S10:Sample_Microarray_Analysis_Vibrio_cholerae ''Vibrio cholerae'' instructions found here.]]
* I opened my spreadsheet and went to the "forGenMAPP" tab.
* I clicked on cell A1 and selected the menu item Data > Filter > Autofilter. Little drop-down arrows appeared at the top of each column. This enabled me to filter the data according to criteria I set.
* I clicked on the drop-down arrow on my "Pvalue" column. I selected "Custom". In the window that appeared, I set a criterion that filtered my data so that the Pvalue was less than 0.05.
**p-value less than 0.05: 1923/3552, 54%
**p-value less than 0.01: 1028/3552, 29%
**p-value less than 0.001: 242/3552, 7%
**p-value less than 0.0001: 40/3552, 1%
**p < 0.05 for the Bonferroni-corrected p value: 9/3552, 0.2%
**p < 0.05 for the Benjamini and Hochberg-corrected p value: 1365/3552, 38%
*Keeping the (unadjusted) "Pvalue" filter at p < 0.05, I filtered the "Avg_ABC_Samples" column to show all genes with an average log fold change greater than zero.
**964/3552, 27%
*Keeping the (unadjusted) "Pvalue" filter at p < 0.05, I filtered the "Avg_ABC_Samples" column to show all genes with an average log fold change less than zero.
**959/3552, 27%
*With an average log fold change of > 0.25 and p < 0.05
**874/3552, 25%
*With an average log fold change of < -0.25 and p < 0.05
**848/3552, 24%
* the fold change cut-off of greater than 0.25 or less than -0.25 and the unadjusted p value cut off of p < 0.05
**1722/3552, 48%
*I then was ready to run my .txt file in GenMAPP.
*I downloaded the .gdb file from my team page [[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/The_Class_Whoopers]] so that I would have it to run GenMAPP with.
* I opened the Expression Dataset Manger from the Data drop-down list in GenMAPP.
* I selected New Dataset from the Expression Datasets menu and choose the tab-delimited text file formatted for GenMAPP (.txt).
* Upon specifying that all data was numerical, the Expression Dataset Manager converted my data to .gex file. This process took approximately one minute to complete. In addition to converting the data to a .gex file, an exceptions file (.EX.txt) was also produced, as 342 errors were reportedly detected in the raw data.
** However, there was a problem at this point because the data set had a few mistakes in it.
* I went back to my data sheet and with the help of Dr. Dahlquist, we discovered that some of the values were incorrect as they displayed: #DIV/0!
** We then replaced all of the #DIV/0! cells with blank cells.
***23 replacements for the #DIV/0!
*I then saved and exported this new .txt file and ran it through GenMAPP again.
* This resulted in fewer errors and everything was smooth.
**339 errors with new .txt file: [[Media:Errors in GenMAPP.png]]
* I customized the new Expression Dataset by creating a Color Sets= with instructions to GenMAPP for displaying data on MAPPs. The new Color Set was entitled "LogFoldChange".
**First, I created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
***I specified the Gene value as "Avg_ABC_Samples" for the Vibrio dataset.
***I activated the Criteria Builder by clicking the New button and named the criterion "Increased".
***I selected the color for this criterion using the color box.
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] > 0.25 AND [Pvalue] < 0.05</code>
**Second, I created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
***I specified the Gene value as "Avg_ABC_Samplesl" for the Vibrio dataset.
***I activated the Criteria Builder by clicking the New button and named the criterion "Decreased".
***I selected the color for this criterion using the color box.
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] < -0.25 AND [Pvalue] < 0.05</code>
* Upon entering these color sets, I savedthe entire Expression Dataset by selecting Save from the Expression Dataset menu.
** The updated .gex fie produced by this procedure can be found here: [[File:Bpertussis CompiledRawData MS2015-3.gex]]
*links to files created:
** [[File:Bpertussis CompiledRawData MS2015-3.EX.txt]]
** [[File:Bpertussis CompiledRawData MS2015-3.xlsx]]
** [[File:Bpertussis CompiledRawData MS2015-3.txt]]
** [[File:Bpertussis CompiledRawData MS2015-3.gex]]
** [[Media:MAPPFinder results for geneontologyresultsCriterion1-GOtxt.png]]
** [[Media:Gene ontology results.png]]
** [[Media:Errors in GenMAPP.png]]

=12/13/15=

* The above steps were repeated due to the creation of a new .gdb file by the Coder. Once downloading the bpertussis-std_cw20151210.gdb file, I obtained the new .txt file and then prepared to import into GenMAPP.

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

I made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[[[File:Bpertussis compiledrawdata cw20151208.txt]]]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
# I customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# I created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*I specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*I activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*I selected the color for this criterion as red using the color box.
#*I stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, I created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*I specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*I activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*I selected the color for this criterion as green using the color box.
#*I stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, I saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated my .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* I was able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, I selected KEGG PATHWAY from the main page.
** Next, I scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, I searched my organism in the drop down menu at the top of the page, and I selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead me to a page of the ribosome pathway with the gene IDs that pertained to my specific organism. I then was able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
**Here is the screenshot of the final mapp for the ribosome pathway created:
* [[File:RibosomeGenMAPP.png]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* I was also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, I selected KEGG PATHWAY from the main page.
** Next, I scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, I searched my organism in the drop down menu at the top of the page, and I selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead me to a page of the nitrogen metabolism pathway with the gene IDs that pertained to my specific organism. I was then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** I launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** I clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** I chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** I checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**I selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

new jpeg of ribosome mapp: [[File:Bpertussis ribosomepathway cw20151215.jpg]]

Lenaolufson Week 15

2015-12-18T21:55:46Z

Lenaolufson: /* 12/13/15 */

File:Bpertussis ribosomepathway cw20151215.jpg

2015-12-18T21:54:50Z

Lenaolufson: jpeg image for ribosome mapp with BHpvalue criteria

jpeg image for ribosome mapp with BHpvalue criteria

Bordetella Pertussis GenMAPP Analysis Deliverables

2015-12-18T21:52:55Z

Lenaolufson: /* Group Files and Datasets */ added new EX file with BHpvalue criteria

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species: [[Media:Bpertussis-std cw20151210.zip]]
* ReadMe file to accompany the Gene Database: [[Media:ReadMe bpertussis-std cw20151210.docx]]
** [[Media:Bpertussis genedatabase schema cw20151210.jpg|Gene Database Schema diagram (also included in ReadMe)]]
* Gene Database Testing Report for final submitted Gene Database: [[Media:Gdb testingreport cw20151210.pdf]]
* Processed and analyzed DNA microarray dataset: [[Media:Bpertussis compiledrawdata cw20151208.xlsx]]
* Data file used for import into GenMAPP: [[Media:Bpertussis compiledrawdata cw20151208.txt]]
* GenMAPP Expression Dataset file: [[Media:Bpertussis compiledrawdata cw20151218.gex]]
* Exceptions file of data imported into GenMAPP: [[Media:Bpertussis compiledrawdata cw20151218.EX.txt]]
* Raw MAPPFinder results files:
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion0-GO.txt|Increased]]
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion1-GO.txt|Decreased]]
* ''.gmf'' file: [[Media:Bpertussis compiledrawdata cw20151218.gmf]]
* Filtered MAPPFinder Results:
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion0-GO - Filtered.xlsx|Increased]]
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion1-GO - Filtered.xlsx|Decreased]]
* Sample MAPP file of a relevant biological pathway for your species: [[File:Bpertussis ribosomepathway cw20151218.mapp]]
* '''[[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis (''.doc'', ''.docx'', or ''.pdf'')'''
* PowerPoint presentation: [[Media:Bpertussis findings powerpoint.pdf]]

==Team Information & Links==
{{Template:Class Whoopers}}

File:Bpertussis compiledrawdata cw20151218.EX.txt

2015-12-18T21:52:14Z

Lenaolufson: new ex txt file with BHpvalue criteria

new ex txt file with BHpvalue criteria

Bordetella Pertussis GenMAPP Analysis Deliverables

2015-12-18T21:51:16Z

Lenaolufson: /* Group Files and Datasets */ added new gex file for BHpvalue criteria

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species: [[Media:Bpertussis-std cw20151210.zip]]
* ReadMe file to accompany the Gene Database: [[Media:ReadMe bpertussis-std cw20151210.docx]]
** [[Media:Bpertussis genedatabase schema cw20151210.jpg|Gene Database Schema diagram (also included in ReadMe)]]
* Gene Database Testing Report for final submitted Gene Database: [[Media:Gdb testingreport cw20151210.pdf]]
* Processed and analyzed DNA microarray dataset: [[Media:Bpertussis compiledrawdata cw20151208.xlsx]]
* Data file used for import into GenMAPP: [[Media:Bpertussis compiledrawdata cw20151208.txt]]
* GenMAPP Expression Dataset file: [[Media:Bpertussis compiledrawdata cw20151218.gex]]
* Exceptions file of data imported into GenMAPP: [[Media:Bpertussis expressiondataset exceptions cw20151213.EX.txt]]
* Raw MAPPFinder results files:
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion0-GO.txt|Increased]]
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion1-GO.txt|Decreased]]
* ''.gmf'' file: [[Media:Bpertussis compiledrawdata cw20151218.gmf]]
* Filtered MAPPFinder Results:
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion0-GO - Filtered.xlsx|Increased]]
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion1-GO - Filtered.xlsx|Decreased]]
* Sample MAPP file of a relevant biological pathway for your species: [[File:Bpertussis ribosomepathway cw20151218.mapp]]
* '''[[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis (''.doc'', ''.docx'', or ''.pdf'')'''
* PowerPoint presentation: [[Media:Bpertussis findings powerpoint.pdf]]

==Team Information & Links==
{{Template:Class Whoopers}}

File:Bpertussis compiledrawdata cw20151218.gex

2015-12-18T21:50:14Z

Lenaolufson: new gex file with BHpvalue criteria

new gex file with BHpvalue criteria

Bordetella Pertussis GenMAPP Analysis Deliverables

2015-12-18T21:48:48Z

Lenaolufson: /* Group Files and Datasets */ added new ribosome mapp with BHpvalue criteria

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species: [[Media:Bpertussis-std cw20151210.zip]]
* ReadMe file to accompany the Gene Database: [[Media:ReadMe bpertussis-std cw20151210.docx]]
** [[Media:Bpertussis genedatabase schema cw20151210.jpg|Gene Database Schema diagram (also included in ReadMe)]]
* Gene Database Testing Report for final submitted Gene Database: [[Media:Gdb testingreport cw20151210.pdf]]
* Processed and analyzed DNA microarray dataset: [[Media:Bpertussis compiledrawdata cw20151208.xlsx]]
* Data file used for import into GenMAPP: [[Media:Bpertussis compiledrawdata cw20151208.txt]]
* GenMAPP Expression Dataset file: [[Media:Bpertussis expressiondataset cw20151213.gex]]
* Exceptions file of data imported into GenMAPP: [[Media:Bpertussis expressiondataset exceptions cw20151213.EX.txt]]
* Raw MAPPFinder results files:
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion0-GO.txt|Increased]]
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion1-GO.txt|Decreased]]
* ''.gmf'' file: [[File:Bpertussis compiledrawdata cw20151218.gmf]]
* Filtered MAPPFinder Results:
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion0-GO - Filtered.xlsx|Increased]]
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion1-GO - Filtered.xlsx|Decreased]]
* Sample MAPP file of a relevant biological pathway for your species: [[File:Bpertussis ribosomepathway cw20151218.mapp]]
* '''[[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis (''.doc'', ''.docx'', or ''.pdf'')'''
* PowerPoint presentation: [[Media:Bpertussis findings powerpoint.pdf]]

==Team Information & Links==
{{Template:Class Whoopers}}

File:Bpertussis ribosomepathway cw20151218.mapp

2015-12-18T21:48:11Z

Lenaolufson: new ribosome mapp with the BHpvalue criteria

new ribosome mapp with the BHpvalue criteria

Bordetella Pertussis GenMAPP Analysis Deliverables

2015-12-18T21:46:20Z

Lenaolufson: /* Group Files and Datasets */ added new gmf file with the BHpvalue criteria

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species: [[Media:Bpertussis-std cw20151210.zip]]
* ReadMe file to accompany the Gene Database: [[Media:ReadMe bpertussis-std cw20151210.docx]]
** [[Media:Bpertussis genedatabase schema cw20151210.jpg|Gene Database Schema diagram (also included in ReadMe)]]
* Gene Database Testing Report for final submitted Gene Database: [[Media:Gdb testingreport cw20151210.pdf]]
* Processed and analyzed DNA microarray dataset: [[Media:Bpertussis compiledrawdata cw20151208.xlsx]]
* Data file used for import into GenMAPP: [[Media:Bpertussis compiledrawdata cw20151208.txt]]
* GenMAPP Expression Dataset file: [[Media:Bpertussis expressiondataset cw20151213.gex]]
* Exceptions file of data imported into GenMAPP: [[Media:Bpertussis expressiondataset exceptions cw20151213.EX.txt]]
* Raw MAPPFinder results files:
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion0-GO.txt|Increased]]
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion1-GO.txt|Decreased]]
* ''.gmf'' file: [[File:Bpertussis compiledrawdata cw20151218.gmf]]
* Filtered MAPPFinder Results:
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion0-GO - Filtered.xlsx|Increased]]
** [[Media:Bpertussis mappfinderresults cw20151218-Criterion1-GO - Filtered.xlsx|Decreased]]
* Sample MAPP file of a relevant biological pathway for your species: [[Media:Bpertussis ribosomepathway cw20151215.mapp]]
* '''[[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis (''.doc'', ''.docx'', or ''.pdf'')'''
* PowerPoint presentation: [[Media:Bpertussis findings powerpoint.pdf]]

==Team Information & Links==
{{Template:Class Whoopers}}

File:Bpertussis compiledrawdata cw20151218.gmf

2015-12-18T21:44:28Z

Lenaolufson: new gmf file with the BHpvalue criteria

new gmf file with the BHpvalue criteria

Bordetella Pertussis GenMAPP Analysis Deliverables

2015-12-18T21:42:02Z

Lenaolufson: /* Group Files and Datasets */ added new excel criterion1 filtered for BHpvalue

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species: [[File:Bpertussis-std cw20151210.zip]]
* ReadMe file to accompany the Gene Database: [[File:ReadMe bpertussis-std cw20151210.docx]]
** [[File:Bpertussis genedatabase schema cw20151210.jpg|Gene Database Schema diagram (also included in ReadMe)]]
* Gene Database Testing Report for final submitted Gene Database: [[File:Gdb testingreport cw20151210.pdf]]
* Processed and analyzed DNA microarray dataset: [[File:Bpertussis compiledrawdata cw20151208.xlsx]]
* Data file used for import into GenMAPP: [[File:Bpertussis compiledrawdata cw20151208.txt]]
* GenMAPP Expression Dataset file: [[File:Bpertussis expressiondataset cw20151213.gex]]
* Exceptions file of data imported into GenMAPP: [[File:Bpertussis expressiondataset exceptions cw20151213.EX.txt]]
* Raw MAPPFinder results files:
** [[File:Bpertussis mappfinderresults cw20151218-Criterion1-GO.txt]]
** [[File:Bpertussis mappfinderresults cw20151218-Criterion0-GO.txt]]
* ''.gmf'' file: [[File:Bpertussis compiledrawdata cw20151213.gmf]]
* Filtered MAPPFinder Results:
** [[File:Bpertussis mappfinderresults cw20151218-Criterion0-GO - Filtered.xlsx]]
** [[File:Bpertussis mappfinderresults cw20151218-Criterion1-GO - Filtered.xlsx]]
* Sample MAPP file of a relevant biological pathway for your species: [[File:Bpertussis ribosomepathway cw20151215.mapp]]
* '''[[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis (''.doc'', ''.docx'', or ''.pdf'')'''
* PowerPoint presentation: [[File:Bpertussis findings powerpoint.pdf]]

==Team Information & Links==
{{Template:Class Whoopers}}

File:Bpertussis mappfinderresults cw20151218-Criterion1-GO - Filtered.xlsx

2015-12-18T21:41:20Z

Lenaolufson: new criterion1 excel for BHpvalue

new criterion1 excel for BHpvalue

Bordetella Pertussis GenMAPP Analysis Deliverables

2015-12-18T21:40:41Z

Lenaolufson: /* Group Files and Datasets */ added new excel criterion0 filtered for BHpvalue

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species: [[File:Bpertussis-std cw20151210.zip]]
* ReadMe file to accompany the Gene Database: [[File:ReadMe bpertussis-std cw20151210.docx]]
** [[File:Bpertussis genedatabase schema cw20151210.jpg|Gene Database Schema diagram (also included in ReadMe)]]
* Gene Database Testing Report for final submitted Gene Database: [[File:Gdb testingreport cw20151210.pdf]]
* Processed and analyzed DNA microarray dataset: [[File:Bpertussis compiledrawdata cw20151208.xlsx]]
* Data file used for import into GenMAPP: [[File:Bpertussis compiledrawdata cw20151208.txt]]
* GenMAPP Expression Dataset file: [[File:Bpertussis expressiondataset cw20151213.gex]]
* Exceptions file of data imported into GenMAPP: [[File:Bpertussis expressiondataset exceptions cw20151213.EX.txt]]
* Raw MAPPFinder results files:
** [[File:Bpertussis mappfinderresults cw20151218-Criterion1-GO.txt]]
** [[File:Bpertussis mappfinderresults cw20151218-Criterion0-GO.txt]]
* ''.gmf'' file: [[File:Bpertussis compiledrawdata cw20151213.gmf]]
* Filtered MAPPFinder Results:
** [[File:Bpertussis mappfinderresults cw20151218-Criterion0-GO - Filtered.xlsx]]
** [[File:Bpertussis mappfinderresults filtered cw20151213-Criterion1-GO.xlsx|Decreased]]
* Sample MAPP file of a relevant biological pathway for your species: [[File:Bpertussis ribosomepathway cw20151215.mapp]]
* '''[[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis (''.doc'', ''.docx'', or ''.pdf'')'''
* PowerPoint presentation: [[File:Bpertussis findings powerpoint.pdf]]

==Team Information & Links==
{{Template:Class Whoopers}}

File:Bpertussis mappfinderresults cw20151218-Criterion0-GO - Filtered.xlsx

2015-12-18T21:40:02Z

Lenaolufson: new excel filtered criterion0 with BHpvalue

new excel filtered criterion0 with BHpvalue

File:Bpertussis mappfinderresults cw20151218-Criterion0-GO - Filtered.txt

2015-12-18T21:38:14Z

Lenaolufson: updated version of criterion0 filtered

updated version of criterion0 filtered

Bordetella Pertussis GenMAPP Analysis Deliverables

2015-12-18T21:37:40Z

Lenaolufson: /* Group Files and Datasets */ added new version of criterion0 raw file

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species: [[File:Bpertussis-std cw20151210.zip]]
* '''ReadMe file to accompany the Gene Database'''
** '''Include Gene Database Schema diagram in ReadMe'''
* Gene Database Testing Report for final submitted Gene Database: [[File:Gdb testingreport cw20151210.pdf]]
* Processed and analyzed DNA microarray dataset: [[File:Bpertussis compiledrawdata cw20151208.xlsx]]
* Data file used for import into GenMAPP: [[File:Bpertussis compiledrawdata cw20151208.txt]]
* GenMAPP Expression Dataset file: [[File:Bpertussis expressiondataset cw20151213.gex]]
* Exceptions file of data imported into GenMAPP: [[File:Bpertussis expressiondataset exceptions cw20151213.EX.txt]]
* Raw MAPPFinder results files:
** [[File:Bpertussis mappfinderresults cw20151218-Criterion1-GO.txt]]
** [[File:Bpertussis mappfinderresults cw20151218-Criterion0-GO.txt]]
* ''.gmf'' file: [[File:Bpertussis compiledrawdata cw20151213.gmf]]
* Filtered MAPPFinder Results:
** [[File:Bpertussis mappfinderresults filtered cw20151213-Criterion0-GO.xlsx|Increased]]
** [[File:Bpertussis mappfinderresults filtered cw20151213-Criterion1-GO.xlsx|Decreased]]
* Sample MAPP file of a relevant biological pathway for your species: [[File:Bpertussis ribosomepathway cw20151215.mapp]]
* '''[[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis (''.doc'', ''.docx'', or ''.pdf'')'''
* PowerPoint presentation: [[File:Bpertussis findings powerpoint.pdf]]

==Team Information & Links==
{{Template:Class Whoopers}}

File:Bpertussis mappfinderresults cw20151218-Criterion0-GO.txt

2015-12-18T21:37:06Z

Lenaolufson: updated version of criterion0 file with BHpvalue

updated version of criterion0 file with BHpvalue

Bordetella Pertussis GenMAPP Analysis Deliverables

2015-12-18T21:36:34Z

Lenaolufson: /* Group Files and Datasets */ updated version of criterion raw file

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species: [[File:Bpertussis-std cw20151210.zip]]
* '''ReadMe file to accompany the Gene Database'''
** '''Include Gene Database Schema diagram in ReadMe'''
* Gene Database Testing Report for final submitted Gene Database: [[File:Gdb testingreport cw20151210.pdf]]
* Processed and analyzed DNA microarray dataset: [[File:Bpertussis compiledrawdata cw20151208.xlsx]]
* Data file used for import into GenMAPP: [[File:Bpertussis compiledrawdata cw20151208.txt]]
* GenMAPP Expression Dataset file: [[File:Bpertussis expressiondataset cw20151213.gex]]
* Exceptions file of data imported into GenMAPP: [[File:Bpertussis expressiondataset exceptions cw20151213.EX.txt]]
* Raw MAPPFinder results files:
** [[File:Bpertussis mappfinderresults cw20151218-Criterion1-GO.txt]]
** [[File:Bpertussis mappfinderresults cw20151213-criterion1-GO.txt|Decreased]]
* ''.gmf'' file: [[File:Bpertussis compiledrawdata cw20151213.gmf]]
* Filtered MAPPFinder Results:
** [[File:Bpertussis mappfinderresults filtered cw20151213-Criterion0-GO.xlsx|Increased]]
** [[File:Bpertussis mappfinderresults filtered cw20151213-Criterion1-GO.xlsx|Decreased]]
* Sample MAPP file of a relevant biological pathway for your species: [[File:Bpertussis ribosomepathway cw20151215.mapp]]
* '''[[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis (''.doc'', ''.docx'', or ''.pdf'')'''
* PowerPoint presentation: [[File:Bpertussis findings powerpoint.pdf]]

==Team Information & Links==
{{Template:Class Whoopers}}

File:Bpertussis mappfinderresults cw20151218-Criterion1-GO.txt

2015-12-18T21:35:32Z

Lenaolufson: updated version of raw criterion1 GO file for BHpvalue

updated version of raw criterion1 GO file for BHpvalue

Lenaolufson Week 15

2015-12-15T05:53:09Z

Lenaolufson: /* Creating a New Color Set */ edited the code for the criteria

Gene Database Testing Report- cw20151210

2015-12-15T05:52:22Z

Lenaolufson: /* Creating a New Color Set */ made minor edit to code for the criteria

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*We download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on our computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* We went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, we navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** We clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**We extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, we navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by the browser. Therefore, we had to manually download the file.

====GO OBO-XML====

* We downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
* We extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# We downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# We extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* We launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, we created a new database: ''bpertussis_cw20151210_gmb3build5''.
** We opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** We clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** We clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, we confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, we launched gmbuilder.bat.
* We selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* We selected File > Export to GenMAPP Gene Database... to begin the export process.
* We typed in our coder's name in the owner field (Brandon Klein).
* We selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* We checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, we clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* We ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*We entered the project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*We used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Note: This query was crafted by [[User:Dondi|Dr. Dionisio]].

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

We opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
We visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[[[File:Bpertussis compiledrawdata cw20151208.txt]]]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: BP0101, BP1677, BP0910A, and BP2029A.
****Searching for any of these gene IDs in UniProt returns the message "Sorry, no results found for your search term.":
*****[[File:ErroneousID Uniprot cw20151210.PNG]]
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
**Here is the screenshot of the final mapp for the ribosome pathway created:
* [[File:RibosomeGenMAPP.png]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

[[Category: Class Whoopers]]

The Class Whoopers

2015-12-15T04:30:30Z

Lenaolufson: /* Progress */ added my individual progress for the week

= Team Information & Links =

{{Template:Class Whoopers}}

=Weekly Updates=
==Week 15==
===Goals===
* '''Assignment due date:''' Midnight Tuesday, December 15
* '''Coder:''' Adjust the GenMAPP Builder code to account for the one EnsemblBacteria reference ID that was missing in our last export; conduct a new import-export cycle to create the (hopefully) final .gdb file; begin characterizing the exported .gdb file in a Gene Database Testing Report; customize the GenMAPP Builder TallyEngine to account for any changes made.
* '''Quality Assurance:'''
* '''GenMAPP User:''' Import data into GenMAPP, create ColorSets, and run MAPPFinder. Document and take notes on test runs with GenMAPP. Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb. Create a .mapp file showing one pathway that is changed in your data.

===Progress===
*'''Brandon (Coder and Project Manager):'''
*'''Mahrad (Quality Assurance):'''
*'''Lena (GenMAPP User):''' I was able to import the data into GenMAPP and then I created color sets in order to run MAPPFinder. I obtained the ontology results and did some background research on what exactly the top results related to from the microarray article. I then used Kegg pathways for my specific organism to create two separate MAPPS, one for ribosome and one for the nitrogen cycle.

===Meetings!===
*This week, our group used class work sessions to coordinate our work:
**Tuesday, December 8, 2:40 - 4:00
**Thursday, December 10, 2:40 - 4:00
*In addition, we scheduled meetings outside of class to work on the final PowerPoint Presentation and deliverables for our project:
**Sunday, December 13, 7:00 PM - 1:00 AM
**Monday, December 14, 2:00 PM - _______

==Week 14==

===Goals===
* '''Assignment due date:''' Midnight Tuesday, December 8
* '''Coder:''' Create the custom species profile for ''Bordetella pertussis'', run an export using the customized version of GenMAPP Builder, add further customizations to the custom species profile as appear necessary, and run a second export using the further customized version of GenMAPP Builder.
* '''Quality Assurance:''' Identify gene IDs that are missing in the first custom export, work with the coder to classify these IDs, configure the Tally Engine, and complete a gene database testing report for the second custom export.
* '''GenMAPP User:''' Complete the statistical analysis of the data, format the data for import into GenMAPP, and coordinate with the coder/QA to import this data into GenMAPP using the custom gene database.

===Progress===
*'''Brandon (Coder and Project Manager):''' This week, I focused on creating and customizing the species profile for ''Bordetella pertussis'' in GenMAPP Builder, the details of which can be found in my [[Bklein7 Week 14| Week 14 Journal Entry]]. I documented the first export I conducted using a custom ''Bordetella pertussis'' species profile here: [[Gene Database Testing Report- cw20151201]]. I demonstrated that the custom species information implemented in this export worked as intended, but Mahrad and I identified 11 ORF genes that failed to export. I updated the ''Bordetella pertussis'' species profile to account for these ORF genes and conducted a new export, detailed here: [[Gene Database Testing Report- cw20151203]]. Mahrad analyzed the exported .gdb file. In addition to this, I kept tabs on my fellow group members to keep us on track to accomplish our long-term project goals in a timely manner.
** [[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 13:39, 7 December 2015 (PST)
*'''Mahrad (Quality Assurance):''' This week as Q and A I worked directly with Brandon to do the initial data exports. The work can be summarized here: [[Msaeedi23 Week 14| Week 14 Journal Entry]]. Next we meticulously characterized regular expression patterns to detect discrepancies in extracting the data from the original samples. In the following week I will work to do the tally configuration to customize it according to our specific species. Now I will focus on the tally configuration which may take some time and coding assistance from Brandon. Once the Tally Engine has been configured to our specific species, Lena can proceed with with GenMAPP processing.
*'''Lena (GenMAPP User):''' This week, I made progress on performing the statistical analysis of the data to prepare it for GenMAPP. I was able to post my progress for each of the class working sessions on my [[Lenaolufson Week 14| Week 14 Journal Entry]] as I updated the excel data sheets after each session. Dr. Dahlquist helped me figure out a problem with the original raw data that was causing the values to be very skewed. I then sent her my updated data sheet and she was able to use a program to separate the duplicates of the chips. After she sent me back the data with the sorted values, I performed the statistical analysis on the data, the most updated version of the file can be found on my Week 14 journal entry linked previously.
[[User:Lenaolufson|Lenaolufson]] ([[User talk:Lenaolufson|talk]]) 19:54, 7 December 2015 (PST)

===Meetings!===
*This week, our group used class work sessions to coordinate our work:
**Tuesday, December 1, 2:40 - 4:00
**Thursday, December 3, 2:40 - 4:00
** Monday, December 7, 10:30 - 12 am

==Week 12==
===Goals===

* '''Assignment due date:''' Midnight Tuesday, November 24
* '''Coder:''' Set up a GitHub repository clone of the XMLPipeDB project on your development device, the development rig, and the initial as-is build for gmbuilder. Complete an import-export cycle in association with QA.
* '''Quality Assurance:''' Complete an import-export cycle for the 1st ''Bordetella pertussis'' gene database. Complete a Gene Database Testing Report for this export.
* '''GenMAPP Users:''' Create a Master Raw Data file that contains the IDs and columns of data required for further analysis. Consult with Dr. Dahlquist on how to process the data (normalization, statistics).

===Progress===
*'''Brandon (Quality Assurance and Interim Coder):''' This week, I focused on completing an import-export cycle for our first ''Bordetella pertussis'' gene database- [[File:Bpertussis-std cw20151119.zip]]. With my QA hat, I imported the appropriate data, exported the gene database, and discussed the gene database creation & counting protocol here- [[Gene Database Testing Report- cw20151119]]. With my Coder hat, I followed the instructions on the [[Coder| Coder Guild Page]] to setup a GitHub repository clone of the XMLPipdeDB project on my personal laptop, the Eclipse developer rig, and the initial as-is build for gmbuilder. The electronic lab notebook for my QA and Coder work is present on my [[Bklein7 Week 12| Week 12 Page]]. Finally, I wrote a PowerPoint presentation on our genome sequencing paper, which is linked to on my [[Bklein7 Week 12| Week 12 Page]] as well.
**[[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 18:48, 23 November 2015 (PST)
*'''Lena (GenMAPP):''' I worked on downloading the correct data sample files from the provided files on the microarray paper page. The files were unzipped and prepared to be imported into excel. In excel, the data was manipulated to form a spreadsheet that had all of the gene IDs from the different samples with their appropriate columns to be analyzed. The corrections and further manipulations of the data are to be continued to be done in the coming week in order to create the desired dataset to be exported from excel. [[File:Bpertussis CompiledRawData MS2015.xlsx]]
**[[User:Lenaolufson|Lenaolufson]] ([[User talk:Lenaolufson|talk]]) 17:33, 23 November 2015 (PST)
*'''Mahrad (GenMAPP--> Quality Assurance)''': Downloaded the six data sample files provided by the microarray paper. Files were unzipped, imported into excel, and manipulated to form a single spreadsheet containing all gene IDs from the different samples. Each sample was placed in its respective column to be further analyzed and manipulated in the upcoming week. Following this, I assumed the position of quality assurance to accommodate the absence of Nicole.
* '''Nicole''' was absent this week. [[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 18:52, 23 November 2015 (PST)

=== Meetings! ===
* Monday, November 23: Seaver 120- Brandon and Lena met to work on the GenMAPP testing of the gene IDs from our database.

==Week 11==
===Goals===
* For all:
** Outline your assigned paper on your user page and include a list of 10 defined terms from the paper.
*Nicole & Brandon
**Prepare Journal Club presentation on the designated genome sequencing article
**Slides Due: by midnight, Tuesday, November 17
**Presentation Date: Tuesday, November 24
*Lena & Mahrad
**Prepare Journal Club presentation on the designated microarray paper
**Slides Due: by midnight, Tuesday, November 17
**Presentation Date: Tuesday, November 17

===Progress===
*Nicole Anguiano (Coder): Nicole was absent this week for a medical emergency and is (hopefully) getting some much deserved rest. [[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 23:14, 16 November 2015 (PST)
*Brandon Klein (QA): This week I made several edits to the [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/The_Class_Whoopers Class Whoopers Team Page] in accordance with the [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/Week_11 Week 11 assignment]. These edits included the following: revising the Class Whoopers template, reorganizing the Team Page structure, commenting out unneeded articles in the annotated bibliography, creating the new bibliography entry as requested by Dr. Dahlquist, and writing the naming conventions for our files. Additionally, I outlined our genome sequencing paper for "Bordetella pertussis" and assessed the [http://www.genedb.org/Homepage/Bpertussis GeneDB MOD] on my [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/Bklein7_Week_11#Identifying_the_Bordetella_Pertussis_MOD Week 11 Individual Journal Entry]. A preliminary draft of the genome sequencing paper that I will likely be presenting solo was uploaded there. Finally, I kept tabs on group members as the interim Project Manager. [[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 23:14, 16 November 2015 (PST)
*Lena Olufson (GennMAPP): This week Mahrad and I met up and analyzed the microarray paper together. We split up the powerpoint into two halves; I did the introduction/significance of the study as well as the methods performed. Mahrad and I created our presentation together and worked through a google doc to edit it simultaneously as we discussed out loud. We also created a flow chart together that demonstrated the experimental design, thus we have the same ones included in our individual assignments. We made sure to check in with the temporary project manager and keep him updated on our progress. [[User:Lenaolufson|Lenaolufson]] ([[User talk:Lenaolufson|talk]]) 23:24, 16 November 2015 (PST)
*Mahrad Saeedi (GennMAPP): This week Lena and I worked on analyzing the microarray paper and creating an outline. We each defined 10 terms separately based upon words we didn't recognize in the article. We then proceeded to producing the powerpoint presentation for journal club.
[[User:Msaeedi23|Msaeedi23]] ([[User talk:Msaeedi23|talk]]) 23:46, 16 November 2015 (PST)

=== Meetings! ===
*11/15- Lena & Mahrad met to work on outlining article and answering questions
*11/16- Lena & Mahrad met to prepare powerpoint presentation for journal club

==Week 10==
===Goals===
* For all:
** Create an annotated bibliography including one genome sequencing paper and two microarray experiments for Bordetella pertussis
** Create/update team page & compile group annotated bibliography
** Assignment due date: Midnight Tuesday, November 10

===Progress===
*All group members created annotated bibliographies and compiled them on the newly created group page.

=== Meetings! ===
* Monday, November 9, 8pm-9pm, Seaver 120

= Deliverables =
Download links to the deliverables for this project can be found here: [[Bordetella Pertussis GenMAPP Analysis Deliverables]]

==File Naming Protocol==
All file types generated in this project will receive their own unique names composed of two key parts:
#Description
#*This will contain a brief, file-specific description of what content the file contains.
#*Descriptions for different versions of the same file will remain consistent.
#Identifier Tag
#*This tag will be listed as a suffix in the following form: "_cwYYYYMMDD"
#**cw- team name abbreviation
#**YYYYMMDD- date the file was created in the form year/month/day

Additionally, the following file naming best practices will be observed when creating descriptions for new files:
*Our species will be referred to consistently as "bpertussis".
*Files including microarray data taken from the paper by Hoo et al. (2014) will begin with "hoo".
*Spaces will be written as underscores.
*No capitalization will be used.
*No special characters will be used.
*If sequential numbering systems are used, leading zeros will be included for clarity.

Sample .xls file name: hoo_analyzed_data_cw20151122.xls

==File Names==
*GenMAPP Gene Database for assigned species (.gdb): '''bpertussis-std_cwYYYYMMDD.gdb'''
*ReadMe file to accompany the Gene Database (.pdf): '''readme_bpertussis-std_external_cwYYYYMMDD.pdf'''
*Include Gene Database Schema diagram in ReadMe: '''bpertussis_schema_cwYYYYMMDD.pdf'''
*Gene Database Testing Report for final submitted Gene Database (print from wiki to .pdf file): '''bpertussis_gdb_report_cwYYYYMMDD.pdf'''
*Processed and analyzed DNA microarray dataset (.xls): '''hoo_analyzed_data_cwYYYYMMDD.xls'''
*GenMAPP Expression Dataset file (.gex): '''hoo_expression_dataset_cwYYYYMMDD.gex'''
*Filtered MAPPFinder Results (.xls): '''hoo_mappfinder_results_cwYYYYMMDD.xls'''
*Sample MAPP file of a relevant biological pathway for your species (.mapp): '''hoo_sample_mapp_cwYYYYMMDD.mapp'''
*Group Report describing the creation of the Gene Database and the biological analysis of the data (.doc or .pdf): '''bpertussis_analysis_methods_cwYYYYMMDD.pdf'''
*PowerPoint presentation (.ppt, given on Tuesday, December 15): '''bpertussis_analysis_presentation_cwYYYYMMDD.ppt'''

==Microarray Journal Club Presentation==
*[[File: Microarray_Journal_Club_Presentation.pdf]]

= Annotated Bibliography =
== Genome Sequencing Paper ==

Neither of these papers is the ''first'' to report the genome sequence of ''B. pertussis.'' The paper that you will want to use is [http://www.nature.com/ng/journal/v35/n1/full/ng1227.html this one]. I found it by looking at the introduction and references of the Zhang et. al (2011) paper. For your Week 11 assignment, please remove your annotated bibliography entries for the two papers below and create one for this new paper by Parkhill et al. (2003). You will use the Parkhill paper for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 09:54, 10 November 2015 (PST)''

*Parkhill, J., Sebaihia, M., Preston, A., Murphy, L. D., et al. (2003). Comparative analysis of the genome sequences of Bordetella pertussis, Bordetella parapertussis and Bordetella bronchiseptica. Nature genetics, 35(1), 32-40. doi:10.1038/ng1227
* PubMed Abstract: http://www.ncbi.nlm.nih.gov/pubmed/12910271
* PubMed Central: Not available on PubMed Central.
* Publisher Full Text (HTML): http://www.nature.com/ng/journal/v35/n1/full/ng1227.html
* Publisher Full Text (PDF): http://www.nature.com/ng/journal/v35/n1/pdf/ng1227.pdf
* Copyright: ©2003 Nature Publishing Group (information found on PDF version of article). This article is not Open Access, but it is freely available 6 months after publication.
* Publisher: Nature Publishing Group (for-profit).
* Availability: In print and online.
* Did LMU pay a fee for this article: Yes, LMU pays a subscription fee for access to the journal ''Nature Genetics''.

== Microarray Paper ==

This paper is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:04, 10 November 2015 (PST)''

Hoo, R., Lam, J.H., Huot, L., Pant, A., Li, R., Hot, D., & Alonso, S. (2014). Evidence for a Role of the Polysaccharide Capsule Transport Proteins in Pertussis Pathogenesis. PLoS ONE, 9(12):e115243. doi: 10.1371/journal.pone.0115243
* PubMed Abstract: http://www.ncbi.nlm.nih.gov/pubmed/25501560
* PubMed Central: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4264864/
* Publisher Full Text (HTML): http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115243
* Publisher Full Text (PDF): http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0115243&representation=PDF
* Copyright: © 2014 Hoo et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited (info found [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115243 here]).
* Publisher: PLOS ONE (respected open access organization).
* Availability: Online only.
* Did LMU pay a fee for this article: No.
* Web site where the data resides: [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62088 NCBI GEO data]

The Class Whoopers

2015-12-15T04:27:46Z

Lenaolufson: /* Goals */ added in the genmapp user goals

= Team Information & Links =

{{Template:Class Whoopers}}

=Weekly Updates=
==Week 15==
===Goals===
* '''Assignment due date:''' Midnight Tuesday, December 15
* '''Coder:''' Adjust the GenMAPP Builder code to account for the one EnsemblBacteria reference ID that was missing in our last export; conduct a new import-export cycle to create the (hopefully) final .gdb file; begin characterizing the exported .gdb file in a Gene Database Testing Report; customize the GenMAPP Builder TallyEngine to account for any changes made.
* '''Quality Assurance:'''
* '''GenMAPP User:''' Import data into GenMAPP, create ColorSets, and run MAPPFinder. Document and take notes on test runs with GenMAPP. Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb. Create a .mapp file showing one pathway that is changed in your data.

===Progress===
*'''Brandon (Coder and Project Manager):'''
*'''Mahrad (Quality Assurance):'''
*'''Lena (GenMAPP User):'''

===Meetings!===
*This week, our group used class work sessions to coordinate our work:
**Tuesday, December 8, 2:40 - 4:00
**Thursday, December 10, 2:40 - 4:00
*In addition, we scheduled meetings outside of class to work on the final PowerPoint Presentation and deliverables for our project:
**Sunday, December 13, 7:00 PM - 1:00 AM
**Monday, December 14, 2:00 PM - _______

==Week 14==

===Goals===
* '''Assignment due date:''' Midnight Tuesday, December 8
* '''Coder:''' Create the custom species profile for ''Bordetella pertussis'', run an export using the customized version of GenMAPP Builder, add further customizations to the custom species profile as appear necessary, and run a second export using the further customized version of GenMAPP Builder.
* '''Quality Assurance:''' Identify gene IDs that are missing in the first custom export, work with the coder to classify these IDs, configure the Tally Engine, and complete a gene database testing report for the second custom export.
* '''GenMAPP User:''' Complete the statistical analysis of the data, format the data for import into GenMAPP, and coordinate with the coder/QA to import this data into GenMAPP using the custom gene database.

===Progress===
*'''Brandon (Coder and Project Manager):''' This week, I focused on creating and customizing the species profile for ''Bordetella pertussis'' in GenMAPP Builder, the details of which can be found in my [[Bklein7 Week 14| Week 14 Journal Entry]]. I documented the first export I conducted using a custom ''Bordetella pertussis'' species profile here: [[Gene Database Testing Report- cw20151201]]. I demonstrated that the custom species information implemented in this export worked as intended, but Mahrad and I identified 11 ORF genes that failed to export. I updated the ''Bordetella pertussis'' species profile to account for these ORF genes and conducted a new export, detailed here: [[Gene Database Testing Report- cw20151203]]. Mahrad analyzed the exported .gdb file. In addition to this, I kept tabs on my fellow group members to keep us on track to accomplish our long-term project goals in a timely manner.
** [[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 13:39, 7 December 2015 (PST)
*'''Mahrad (Quality Assurance):''' This week as Q and A I worked directly with Brandon to do the initial data exports. The work can be summarized here: [[Msaeedi23 Week 14| Week 14 Journal Entry]]. Next we meticulously characterized regular expression patterns to detect discrepancies in extracting the data from the original samples. In the following week I will work to do the tally configuration to customize it according to our specific species. Now I will focus on the tally configuration which may take some time and coding assistance from Brandon. Once the Tally Engine has been configured to our specific species, Lena can proceed with with GenMAPP processing.
*'''Lena (GenMAPP User):''' This week, I made progress on performing the statistical analysis of the data to prepare it for GenMAPP. I was able to post my progress for each of the class working sessions on my [[Lenaolufson Week 14| Week 14 Journal Entry]] as I updated the excel data sheets after each session. Dr. Dahlquist helped me figure out a problem with the original raw data that was causing the values to be very skewed. I then sent her my updated data sheet and she was able to use a program to separate the duplicates of the chips. After she sent me back the data with the sorted values, I performed the statistical analysis on the data, the most updated version of the file can be found on my Week 14 journal entry linked previously.
[[User:Lenaolufson|Lenaolufson]] ([[User talk:Lenaolufson|talk]]) 19:54, 7 December 2015 (PST)

===Meetings!===
*This week, our group used class work sessions to coordinate our work:
**Tuesday, December 1, 2:40 - 4:00
**Thursday, December 3, 2:40 - 4:00
** Monday, December 7, 10:30 - 12 am

==Week 12==
===Goals===

* '''Assignment due date:''' Midnight Tuesday, November 24
* '''Coder:''' Set up a GitHub repository clone of the XMLPipeDB project on your development device, the development rig, and the initial as-is build for gmbuilder. Complete an import-export cycle in association with QA.
* '''Quality Assurance:''' Complete an import-export cycle for the 1st ''Bordetella pertussis'' gene database. Complete a Gene Database Testing Report for this export.
* '''GenMAPP Users:''' Create a Master Raw Data file that contains the IDs and columns of data required for further analysis. Consult with Dr. Dahlquist on how to process the data (normalization, statistics).

===Progress===
*'''Brandon (Quality Assurance and Interim Coder):''' This week, I focused on completing an import-export cycle for our first ''Bordetella pertussis'' gene database- [[File:Bpertussis-std cw20151119.zip]]. With my QA hat, I imported the appropriate data, exported the gene database, and discussed the gene database creation & counting protocol here- [[Gene Database Testing Report- cw20151119]]. With my Coder hat, I followed the instructions on the [[Coder| Coder Guild Page]] to setup a GitHub repository clone of the XMLPipdeDB project on my personal laptop, the Eclipse developer rig, and the initial as-is build for gmbuilder. The electronic lab notebook for my QA and Coder work is present on my [[Bklein7 Week 12| Week 12 Page]]. Finally, I wrote a PowerPoint presentation on our genome sequencing paper, which is linked to on my [[Bklein7 Week 12| Week 12 Page]] as well.
**[[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 18:48, 23 November 2015 (PST)
*'''Lena (GenMAPP):''' I worked on downloading the correct data sample files from the provided files on the microarray paper page. The files were unzipped and prepared to be imported into excel. In excel, the data was manipulated to form a spreadsheet that had all of the gene IDs from the different samples with their appropriate columns to be analyzed. The corrections and further manipulations of the data are to be continued to be done in the coming week in order to create the desired dataset to be exported from excel. [[File:Bpertussis CompiledRawData MS2015.xlsx]]
**[[User:Lenaolufson|Lenaolufson]] ([[User talk:Lenaolufson|talk]]) 17:33, 23 November 2015 (PST)
*'''Mahrad (GenMAPP--> Quality Assurance)''': Downloaded the six data sample files provided by the microarray paper. Files were unzipped, imported into excel, and manipulated to form a single spreadsheet containing all gene IDs from the different samples. Each sample was placed in its respective column to be further analyzed and manipulated in the upcoming week. Following this, I assumed the position of quality assurance to accommodate the absence of Nicole.
* '''Nicole''' was absent this week. [[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 18:52, 23 November 2015 (PST)

=== Meetings! ===
* Monday, November 23: Seaver 120- Brandon and Lena met to work on the GenMAPP testing of the gene IDs from our database.

==Week 11==
===Goals===
* For all:
** Outline your assigned paper on your user page and include a list of 10 defined terms from the paper.
*Nicole & Brandon
**Prepare Journal Club presentation on the designated genome sequencing article
**Slides Due: by midnight, Tuesday, November 17
**Presentation Date: Tuesday, November 24
*Lena & Mahrad
**Prepare Journal Club presentation on the designated microarray paper
**Slides Due: by midnight, Tuesday, November 17
**Presentation Date: Tuesday, November 17

===Progress===
*Nicole Anguiano (Coder): Nicole was absent this week for a medical emergency and is (hopefully) getting some much deserved rest. [[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 23:14, 16 November 2015 (PST)
*Brandon Klein (QA): This week I made several edits to the [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/The_Class_Whoopers Class Whoopers Team Page] in accordance with the [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/Week_11 Week 11 assignment]. These edits included the following: revising the Class Whoopers template, reorganizing the Team Page structure, commenting out unneeded articles in the annotated bibliography, creating the new bibliography entry as requested by Dr. Dahlquist, and writing the naming conventions for our files. Additionally, I outlined our genome sequencing paper for "Bordetella pertussis" and assessed the [http://www.genedb.org/Homepage/Bpertussis GeneDB MOD] on my [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/Bklein7_Week_11#Identifying_the_Bordetella_Pertussis_MOD Week 11 Individual Journal Entry]. A preliminary draft of the genome sequencing paper that I will likely be presenting solo was uploaded there. Finally, I kept tabs on group members as the interim Project Manager. [[User:Bklein7|Bklein7]] ([[User talk:Bklein7|talk]]) 23:14, 16 November 2015 (PST)
*Lena Olufson (GennMAPP): This week Mahrad and I met up and analyzed the microarray paper together. We split up the powerpoint into two halves; I did the introduction/significance of the study as well as the methods performed. Mahrad and I created our presentation together and worked through a google doc to edit it simultaneously as we discussed out loud. We also created a flow chart together that demonstrated the experimental design, thus we have the same ones included in our individual assignments. We made sure to check in with the temporary project manager and keep him updated on our progress. [[User:Lenaolufson|Lenaolufson]] ([[User talk:Lenaolufson|talk]]) 23:24, 16 November 2015 (PST)
*Mahrad Saeedi (GennMAPP): This week Lena and I worked on analyzing the microarray paper and creating an outline. We each defined 10 terms separately based upon words we didn't recognize in the article. We then proceeded to producing the powerpoint presentation for journal club.
[[User:Msaeedi23|Msaeedi23]] ([[User talk:Msaeedi23|talk]]) 23:46, 16 November 2015 (PST)

=== Meetings! ===
*11/15- Lena & Mahrad met to work on outlining article and answering questions
*11/16- Lena & Mahrad met to prepare powerpoint presentation for journal club

==Week 10==
===Goals===
* For all:
** Create an annotated bibliography including one genome sequencing paper and two microarray experiments for Bordetella pertussis
** Create/update team page & compile group annotated bibliography
** Assignment due date: Midnight Tuesday, November 10

===Progress===
*All group members created annotated bibliographies and compiled them on the newly created group page.

=== Meetings! ===
* Monday, November 9, 8pm-9pm, Seaver 120

= Deliverables =
Download links to the deliverables for this project can be found here: [[Bordetella Pertussis GenMAPP Analysis Deliverables]]

==File Naming Protocol==
All file types generated in this project will receive their own unique names composed of two key parts:
#Description
#*This will contain a brief, file-specific description of what content the file contains.
#*Descriptions for different versions of the same file will remain consistent.
#Identifier Tag
#*This tag will be listed as a suffix in the following form: "_cwYYYYMMDD"
#**cw- team name abbreviation
#**YYYYMMDD- date the file was created in the form year/month/day

Additionally, the following file naming best practices will be observed when creating descriptions for new files:
*Our species will be referred to consistently as "bpertussis".
*Files including microarray data taken from the paper by Hoo et al. (2014) will begin with "hoo".
*Spaces will be written as underscores.
*No capitalization will be used.
*No special characters will be used.
*If sequential numbering systems are used, leading zeros will be included for clarity.

Sample .xls file name: hoo_analyzed_data_cw20151122.xls

==File Names==
*GenMAPP Gene Database for assigned species (.gdb): '''bpertussis-std_cwYYYYMMDD.gdb'''
*ReadMe file to accompany the Gene Database (.pdf): '''readme_bpertussis-std_external_cwYYYYMMDD.pdf'''
*Include Gene Database Schema diagram in ReadMe: '''bpertussis_schema_cwYYYYMMDD.pdf'''
*Gene Database Testing Report for final submitted Gene Database (print from wiki to .pdf file): '''bpertussis_gdb_report_cwYYYYMMDD.pdf'''
*Processed and analyzed DNA microarray dataset (.xls): '''hoo_analyzed_data_cwYYYYMMDD.xls'''
*GenMAPP Expression Dataset file (.gex): '''hoo_expression_dataset_cwYYYYMMDD.gex'''
*Filtered MAPPFinder Results (.xls): '''hoo_mappfinder_results_cwYYYYMMDD.xls'''
*Sample MAPP file of a relevant biological pathway for your species (.mapp): '''hoo_sample_mapp_cwYYYYMMDD.mapp'''
*Group Report describing the creation of the Gene Database and the biological analysis of the data (.doc or .pdf): '''bpertussis_analysis_methods_cwYYYYMMDD.pdf'''
*PowerPoint presentation (.ppt, given on Tuesday, December 15): '''bpertussis_analysis_presentation_cwYYYYMMDD.ppt'''

==Microarray Journal Club Presentation==
*[[File: Microarray_Journal_Club_Presentation.pdf]]

= Annotated Bibliography =
== Genome Sequencing Paper ==

Neither of these papers is the ''first'' to report the genome sequence of ''B. pertussis.'' The paper that you will want to use is [http://www.nature.com/ng/journal/v35/n1/full/ng1227.html this one]. I found it by looking at the introduction and references of the Zhang et. al (2011) paper. For your Week 11 assignment, please remove your annotated bibliography entries for the two papers below and create one for this new paper by Parkhill et al. (2003). You will use the Parkhill paper for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 09:54, 10 November 2015 (PST)''

*Parkhill, J., Sebaihia, M., Preston, A., Murphy, L. D., et al. (2003). Comparative analysis of the genome sequences of Bordetella pertussis, Bordetella parapertussis and Bordetella bronchiseptica. Nature genetics, 35(1), 32-40. doi:10.1038/ng1227
* PubMed Abstract: http://www.ncbi.nlm.nih.gov/pubmed/12910271
* PubMed Central: Not available on PubMed Central.
* Publisher Full Text (HTML): http://www.nature.com/ng/journal/v35/n1/full/ng1227.html
* Publisher Full Text (PDF): http://www.nature.com/ng/journal/v35/n1/pdf/ng1227.pdf
* Copyright: ©2003 Nature Publishing Group (information found on PDF version of article). This article is not Open Access, but it is freely available 6 months after publication.
* Publisher: Nature Publishing Group (for-profit).
* Availability: In print and online.
* Did LMU pay a fee for this article: Yes, LMU pays a subscription fee for access to the journal ''Nature Genetics''.

== Microarray Paper ==

This paper is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:04, 10 November 2015 (PST)''

Hoo, R., Lam, J.H., Huot, L., Pant, A., Li, R., Hot, D., & Alonso, S. (2014). Evidence for a Role of the Polysaccharide Capsule Transport Proteins in Pertussis Pathogenesis. PLoS ONE, 9(12):e115243. doi: 10.1371/journal.pone.0115243
* PubMed Abstract: http://www.ncbi.nlm.nih.gov/pubmed/25501560
* PubMed Central: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4264864/
* Publisher Full Text (HTML): http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115243
* Publisher Full Text (PDF): http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0115243&representation=PDF
* Copyright: © 2014 Hoo et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited (info found [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115243 here]).
* Publisher: PLOS ONE (respected open access organization).
* Availability: Online only.
* Did LMU pay a fee for this article: No.
* Web site where the data resides: [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62088 NCBI GEO data]

Lenaolufson Week 15

2015-12-15T04:25:23Z

Lenaolufson: /* 12/13/15 */ edited the text to what operations I performed

=12/8/15=
*It was now time for me to prepare my file for GenMAPP, and I did so by the [http://www.openwetware.org/wiki/BIOL398-01/S10:Sample_Microarray_Analysis_Vibrio_cholerae ''Vibrio cholerae'' instructions found here.]]
* I inserted a new worksheet and named it "forGenMAPP".
* I went back to the "statistics" worksheet and Selected All and Copied.
*I went to my new sheet and clicked on cell A1 and selected Paste Special, clicked on the Values radio button, and clicked OK.
** I then deleted the ID columns besides the far left one in column A, and I deleted the second MasterIndex column because it was unnecessary.
** I added a "1" before all of the titles of columns D through I so that none of the columns would have the same names due to the replicates.
* I selected Columns V through Y (all the fold changes). I selected the menu item Format > Cells. Under the number tab, I selected 2 decimal places. I clicked OK.
* I selected all the columns containing p values. I selected the menu item Format > Cells. Under the number tab, I selected 4 decimal places. I clicked OK.
* I deleted the left-most Bonferroni p value column, preserving the one that showed the result of my "if" statement.
* I inserted a column to the right of the "ID" column. I typed the header "SystemCode" into the top cell of this column. I filled the entire column (each cell) with the letter "N".
* I selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu.
*After preparing it for GenMAPP, here are the .xls and .txt files:
** [[File:Bpertussis CompiledRawData MS2015-3.xlsx]]
** [[File:Bpertussis CompiledRawData MS2015-3.txt]]
* Then it was time to perform a sanity check, which was done using the [http://www.openwetware.org/wiki/BIOL398-01/S10:Sample_Microarray_Analysis_Vibrio_cholerae ''Vibrio cholerae'' instructions found here.]]
* I opened my spreadsheet and went to the "forGenMAPP" tab.
* I clicked on cell A1 and selected the menu item Data > Filter > Autofilter. Little drop-down arrows appeared at the top of each column. This enabled me to filter the data according to criteria I set.
* I clicked on the drop-down arrow on my "Pvalue" column. I selected "Custom". In the window that appeared, I set a criterion that filtered my data so that the Pvalue was less than 0.05.
**p-value less than 0.05: 1923/3552, 54%
**p-value less than 0.01: 1028/3552, 29%
**p-value less than 0.001: 242/3552, 7%
**p-value less than 0.0001: 40/3552, 1%
**p < 0.05 for the Bonferroni-corrected p value: 9/3552, 0.2%
**p < 0.05 for the Benjamini and Hochberg-corrected p value: 1365/3552, 38%
*Keeping the (unadjusted) "Pvalue" filter at p < 0.05, I filtered the "Avg_ABC_Samples" column to show all genes with an average log fold change greater than zero.
**964/3552, 27%
*Keeping the (unadjusted) "Pvalue" filter at p < 0.05, I filtered the "Avg_ABC_Samples" column to show all genes with an average log fold change less than zero.
**959/3552, 27%
*With an average log fold change of > 0.25 and p < 0.05
**874/3552, 25%
*With an average log fold change of < -0.25 and p < 0.05
**848/3552, 24%
* the fold change cut-off of greater than 0.25 or less than -0.25 and the unadjusted p value cut off of p < 0.05
**1722/3552, 48%
*I then was ready to run my .txt file in GenMAPP.
*I downloaded the .gdb file from my team page [[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/The_Class_Whoopers]] so that I would have it to run GenMAPP with.
* I opened the Expression Dataset Manger from the Data drop-down list in GenMAPP.
* I selected New Dataset from the Expression Datasets menu and choose the tab-delimited text file formatted for GenMAPP (.txt).
* Upon specifying that all data was numerical, the Expression Dataset Manager converted my data to .gex file. This process took approximately one minute to complete. In addition to converting the data to a .gex file, an exceptions file (.EX.txt) was also produced, as 342 errors were reportedly detected in the raw data.
** However, there was a problem at this point because the data set had a few mistakes in it.
* I went back to my data sheet and with the help of Dr. Dahlquist, we discovered that some of the values were incorrect as they displayed: #DIV/0!
** We then replaced all of the #DIV/0! cells with blank cells.
***23 replacements for the #DIV/0!
*I then saved and exported this new .txt file and ran it through GenMAPP again.
* This resulted in fewer errors and everything was smooth.
**339 errors with new .txt file: [[Media:Errors in GenMAPP.png]]
* I customized the new Expression Dataset by creating a Color Sets= with instructions to GenMAPP for displaying data on MAPPs. The new Color Set was entitled "LogFoldChange".
**First, I created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
***I specified the Gene value as "Avg_ABC_Samples" for the Vibrio dataset.
***I activated the Criteria Builder by clicking the New button and named the criterion "Increased".
***I selected the color for this criterion using the color box.
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] > 0.25 AND [Pvalue] < 0.05</code>
**Second, I created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
***I specified the Gene value as "Avg_ABC_Samplesl" for the Vibrio dataset.
***I activated the Criteria Builder by clicking the New button and named the criterion "Decreased".
***I selected the color for this criterion using the color box.
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] < -0.25 AND [Pvalue] < 0.05</code>
* Upon entering these color sets, I savedthe entire Expression Dataset by selecting Save from the Expression Dataset menu.
** The updated .gex fie produced by this procedure can be found here: [[File:Bpertussis CompiledRawData MS2015-3.gex]]
*links to files created:
** [[File:Bpertussis CompiledRawData MS2015-3.EX.txt]]
** [[File:Bpertussis CompiledRawData MS2015-3.xlsx]]
** [[File:Bpertussis CompiledRawData MS2015-3.txt]]
** [[File:Bpertussis CompiledRawData MS2015-3.gex]]
** [[Media:MAPPFinder results for geneontologyresultsCriterion1-GOtxt.png]]
** [[Media:Gene ontology results.png]]
** [[Media:Errors in GenMAPP.png]]

=12/13/15=

* The above steps were repeated due to the creation of a new .gdb file by the Coder. Once downloading the bpertussis-std_cw20151210.gdb file, I obtained the new .txt file and then prepared to import into GenMAPP.

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

I made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[[[File:Bpertussis compiledrawdata cw20151208.txt]]]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
# I customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# I created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*I specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*I activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*I selected the color for this criterion as red using the color box.
#*I stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, I created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*I specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*I activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*I selected the color for this criterion as green using the color box.
#*I stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, I saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated my .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* I was able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, I selected KEGG PATHWAY from the main page.
** Next, I scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, I searched my organism in the drop down menu at the top of the page, and I selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead me to a page of the ribosome pathway with the gene IDs that pertained to my specific organism. I then was able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
**Here is the screenshot of the final mapp for the ribosome pathway created:
* [[File:RibosomeGenMAPP.png]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* I was also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, I selected KEGG PATHWAY from the main page.
** Next, I scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, I searched my organism in the drop down menu at the top of the page, and I selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead me to a page of the nitrogen metabolism pathway with the gene IDs that pertained to my specific organism. I was then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** I launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** I clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** I chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** I checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**I selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

Gene Database Testing Report- cw20151210

2015-12-15T04:24:28Z

Lenaolufson: /* Nitrogen Cycle Kegg Pathway */

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*We download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on our computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* We went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, we navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** We clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**We extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, we navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by the browser. Therefore, we had to manually download the file.

====GO OBO-XML====

* We downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
* We extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# We downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# We extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* We launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, we created a new database: ''bpertussis_cw20151210_gmb3build5''.
** We opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** We clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** We clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, we confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, we launched gmbuilder.bat.
* We selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* We selected File > Export to GenMAPP Gene Database... to begin the export process.
* We typed in our coder's name in the owner field (Brandon Klein).
* We selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* We checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, we clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* We ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*We entered the project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*We used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Note: This query was crafted by [[User:Dondi|Dr. Dionisio]].

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

We opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
We visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[[[File:Bpertussis compiledrawdata cw20151208.txt]]]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: BP0101, BP1677, BP0910A, and BP2029A.
****Searching for any of these gene IDs in UniProt returns the message "Sorry, no results found for your search term.":
*****[[File:ErroneousID Uniprot cw20151210.PNG]]
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
**Here is the screenshot of the final mapp for the ribosome pathway created:
* [[File:RibosomeGenMAPP.png]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

[[Category: Class Whoopers]]

Gene Database Testing Report- cw20151210

2015-12-15T04:23:06Z

Lenaolufson: /* Ribosome Kegg Pathway */

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*We download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on our computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* We went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, we navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** We clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**We extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, we navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by the browser. Therefore, we had to manually download the file.

====GO OBO-XML====

* We downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
* We extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# We downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# We extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* We launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, we created a new database: ''bpertussis_cw20151210_gmb3build5''.
** We opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** We clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** We clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, we confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, we launched gmbuilder.bat.
* We selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* We selected File > Export to GenMAPP Gene Database... to begin the export process.
* We typed in our coder's name in the owner field (Brandon Klein).
* We selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* We checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, we clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* We ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*We entered the project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*We used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Note: This query was crafted by [[User:Dondi|Dr. Dionisio]].

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

We opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
We visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[[[File:Bpertussis compiledrawdata cw20151208.txt]]]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: BP0101, BP1677, BP0910A, and BP2029A.
****Searching for any of these gene IDs in UniProt returns the message "Sorry, no results found for your search term.":
*****[[File:ErroneousID Uniprot cw20151210.PNG]]
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the green highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
**Here is the screenshot of the final mapp for the ribosome pathway created:
* [[File:RibosomeGenMAPP.png]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the gene highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

[[Category: Class Whoopers]]

Lenaolufson Week 15

2015-12-15T04:13:53Z

Lenaolufson: /* 12/13/15 */

=12/8/15=
*It was now time for me to prepare my file for GenMAPP, and I did so by the [http://www.openwetware.org/wiki/BIOL398-01/S10:Sample_Microarray_Analysis_Vibrio_cholerae ''Vibrio cholerae'' instructions found here.]]
* I inserted a new worksheet and named it "forGenMAPP".
* I went back to the "statistics" worksheet and Selected All and Copied.
*I went to my new sheet and clicked on cell A1 and selected Paste Special, clicked on the Values radio button, and clicked OK.
** I then deleted the ID columns besides the far left one in column A, and I deleted the second MasterIndex column because it was unnecessary.
** I added a "1" before all of the titles of columns D through I so that none of the columns would have the same names due to the replicates.
* I selected Columns V through Y (all the fold changes). I selected the menu item Format > Cells. Under the number tab, I selected 2 decimal places. I clicked OK.
* I selected all the columns containing p values. I selected the menu item Format > Cells. Under the number tab, I selected 4 decimal places. I clicked OK.
* I deleted the left-most Bonferroni p value column, preserving the one that showed the result of my "if" statement.
* I inserted a column to the right of the "ID" column. I typed the header "SystemCode" into the top cell of this column. I filled the entire column (each cell) with the letter "N".
* I selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu.
*After preparing it for GenMAPP, here are the .xls and .txt files:
** [[File:Bpertussis CompiledRawData MS2015-3.xlsx]]
** [[File:Bpertussis CompiledRawData MS2015-3.txt]]
* Then it was time to perform a sanity check, which was done using the [http://www.openwetware.org/wiki/BIOL398-01/S10:Sample_Microarray_Analysis_Vibrio_cholerae ''Vibrio cholerae'' instructions found here.]]
* I opened my spreadsheet and went to the "forGenMAPP" tab.
* I clicked on cell A1 and selected the menu item Data > Filter > Autofilter. Little drop-down arrows appeared at the top of each column. This enabled me to filter the data according to criteria I set.
* I clicked on the drop-down arrow on my "Pvalue" column. I selected "Custom". In the window that appeared, I set a criterion that filtered my data so that the Pvalue was less than 0.05.
**p-value less than 0.05: 1923/3552, 54%
**p-value less than 0.01: 1028/3552, 29%
**p-value less than 0.001: 242/3552, 7%
**p-value less than 0.0001: 40/3552, 1%
**p < 0.05 for the Bonferroni-corrected p value: 9/3552, 0.2%
**p < 0.05 for the Benjamini and Hochberg-corrected p value: 1365/3552, 38%
*Keeping the (unadjusted) "Pvalue" filter at p < 0.05, I filtered the "Avg_ABC_Samples" column to show all genes with an average log fold change greater than zero.
**964/3552, 27%
*Keeping the (unadjusted) "Pvalue" filter at p < 0.05, I filtered the "Avg_ABC_Samples" column to show all genes with an average log fold change less than zero.
**959/3552, 27%
*With an average log fold change of > 0.25 and p < 0.05
**874/3552, 25%
*With an average log fold change of < -0.25 and p < 0.05
**848/3552, 24%
* the fold change cut-off of greater than 0.25 or less than -0.25 and the unadjusted p value cut off of p < 0.05
**1722/3552, 48%
*I then was ready to run my .txt file in GenMAPP.
*I downloaded the .gdb file from my team page [[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/The_Class_Whoopers]] so that I would have it to run GenMAPP with.
* I opened the Expression Dataset Manger from the Data drop-down list in GenMAPP.
* I selected New Dataset from the Expression Datasets menu and choose the tab-delimited text file formatted for GenMAPP (.txt).
* Upon specifying that all data was numerical, the Expression Dataset Manager converted my data to .gex file. This process took approximately one minute to complete. In addition to converting the data to a .gex file, an exceptions file (.EX.txt) was also produced, as 342 errors were reportedly detected in the raw data.
** However, there was a problem at this point because the data set had a few mistakes in it.
* I went back to my data sheet and with the help of Dr. Dahlquist, we discovered that some of the values were incorrect as they displayed: #DIV/0!
** We then replaced all of the #DIV/0! cells with blank cells.
***23 replacements for the #DIV/0!
*I then saved and exported this new .txt file and ran it through GenMAPP again.
* This resulted in fewer errors and everything was smooth.
**339 errors with new .txt file: [[Media:Errors in GenMAPP.png]]
* I customized the new Expression Dataset by creating a Color Sets= with instructions to GenMAPP for displaying data on MAPPs. The new Color Set was entitled "LogFoldChange".
**First, I created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
***I specified the Gene value as "Avg_ABC_Samples" for the Vibrio dataset.
***I activated the Criteria Builder by clicking the New button and named the criterion "Increased".
***I selected the color for this criterion using the color box.
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] > 0.25 AND [Pvalue] < 0.05</code>
**Second, I created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
***I specified the Gene value as "Avg_ABC_Samplesl" for the Vibrio dataset.
***I activated the Criteria Builder by clicking the New button and named the criterion "Decreased".
***I selected the color for this criterion using the color box.
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] < -0.25 AND [Pvalue] < 0.05</code>
* Upon entering these color sets, I savedthe entire Expression Dataset by selecting Save from the Expression Dataset menu.
** The updated .gex fie produced by this procedure can be found here: [[File:Bpertussis CompiledRawData MS2015-3.gex]]
*links to files created:
** [[File:Bpertussis CompiledRawData MS2015-3.EX.txt]]
** [[File:Bpertussis CompiledRawData MS2015-3.xlsx]]
** [[File:Bpertussis CompiledRawData MS2015-3.txt]]
** [[File:Bpertussis CompiledRawData MS2015-3.gex]]
** [[Media:MAPPFinder results for geneontologyresultsCriterion1-GOtxt.png]]
** [[Media:Gene ontology results.png]]
** [[Media:Errors in GenMAPP.png]]

=12/13/15=

* The above steps were repeated due to the creation of a new .gdb file by the Coder. Once downloading the ________ file, I ran it through the

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[[[File:Bpertussis compiledrawdata cw20151208.txt]]]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the gene highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
**Here is the screenshot of the final mapp for the ribosome pathway created:
* [[File:RibosomeGenMAPP.png]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the gene highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

Lenaolufson Week 15

2015-12-15T04:13:34Z

Lenaolufson: /* 12/8/15 */

=12/8/15=
*It was now time for me to prepare my file for GenMAPP, and I did so by the [http://www.openwetware.org/wiki/BIOL398-01/S10:Sample_Microarray_Analysis_Vibrio_cholerae ''Vibrio cholerae'' instructions found here.]]
* I inserted a new worksheet and named it "forGenMAPP".
* I went back to the "statistics" worksheet and Selected All and Copied.
*I went to my new sheet and clicked on cell A1 and selected Paste Special, clicked on the Values radio button, and clicked OK.
** I then deleted the ID columns besides the far left one in column A, and I deleted the second MasterIndex column because it was unnecessary.
** I added a "1" before all of the titles of columns D through I so that none of the columns would have the same names due to the replicates.
* I selected Columns V through Y (all the fold changes). I selected the menu item Format > Cells. Under the number tab, I selected 2 decimal places. I clicked OK.
* I selected all the columns containing p values. I selected the menu item Format > Cells. Under the number tab, I selected 4 decimal places. I clicked OK.
* I deleted the left-most Bonferroni p value column, preserving the one that showed the result of my "if" statement.
* I inserted a column to the right of the "ID" column. I typed the header "SystemCode" into the top cell of this column. I filled the entire column (each cell) with the letter "N".
* I selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu.
*After preparing it for GenMAPP, here are the .xls and .txt files:
** [[File:Bpertussis CompiledRawData MS2015-3.xlsx]]
** [[File:Bpertussis CompiledRawData MS2015-3.txt]]
* Then it was time to perform a sanity check, which was done using the [http://www.openwetware.org/wiki/BIOL398-01/S10:Sample_Microarray_Analysis_Vibrio_cholerae ''Vibrio cholerae'' instructions found here.]]
* I opened my spreadsheet and went to the "forGenMAPP" tab.
* I clicked on cell A1 and selected the menu item Data > Filter > Autofilter. Little drop-down arrows appeared at the top of each column. This enabled me to filter the data according to criteria I set.
* I clicked on the drop-down arrow on my "Pvalue" column. I selected "Custom". In the window that appeared, I set a criterion that filtered my data so that the Pvalue was less than 0.05.
**p-value less than 0.05: 1923/3552, 54%
**p-value less than 0.01: 1028/3552, 29%
**p-value less than 0.001: 242/3552, 7%
**p-value less than 0.0001: 40/3552, 1%
**p < 0.05 for the Bonferroni-corrected p value: 9/3552, 0.2%
**p < 0.05 for the Benjamini and Hochberg-corrected p value: 1365/3552, 38%
*Keeping the (unadjusted) "Pvalue" filter at p < 0.05, I filtered the "Avg_ABC_Samples" column to show all genes with an average log fold change greater than zero.
**964/3552, 27%
*Keeping the (unadjusted) "Pvalue" filter at p < 0.05, I filtered the "Avg_ABC_Samples" column to show all genes with an average log fold change less than zero.
**959/3552, 27%
*With an average log fold change of > 0.25 and p < 0.05
**874/3552, 25%
*With an average log fold change of < -0.25 and p < 0.05
**848/3552, 24%
* the fold change cut-off of greater than 0.25 or less than -0.25 and the unadjusted p value cut off of p < 0.05
**1722/3552, 48%
*I then was ready to run my .txt file in GenMAPP.
*I downloaded the .gdb file from my team page [[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/The_Class_Whoopers]] so that I would have it to run GenMAPP with.
* I opened the Expression Dataset Manger from the Data drop-down list in GenMAPP.
* I selected New Dataset from the Expression Datasets menu and choose the tab-delimited text file formatted for GenMAPP (.txt).
* Upon specifying that all data was numerical, the Expression Dataset Manager converted my data to .gex file. This process took approximately one minute to complete. In addition to converting the data to a .gex file, an exceptions file (.EX.txt) was also produced, as 342 errors were reportedly detected in the raw data.
** However, there was a problem at this point because the data set had a few mistakes in it.
* I went back to my data sheet and with the help of Dr. Dahlquist, we discovered that some of the values were incorrect as they displayed: #DIV/0!
** We then replaced all of the #DIV/0! cells with blank cells.
***23 replacements for the #DIV/0!
*I then saved and exported this new .txt file and ran it through GenMAPP again.
* This resulted in fewer errors and everything was smooth.
**339 errors with new .txt file: [[Media:Errors in GenMAPP.png]]
* I customized the new Expression Dataset by creating a Color Sets= with instructions to GenMAPP for displaying data on MAPPs. The new Color Set was entitled "LogFoldChange".
**First, I created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
***I specified the Gene value as "Avg_ABC_Samples" for the Vibrio dataset.
***I activated the Criteria Builder by clicking the New button and named the criterion "Increased".
***I selected the color for this criterion using the color box.
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] > 0.25 AND [Pvalue] < 0.05</code>
**Second, I created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
***I specified the Gene value as "Avg_ABC_Samplesl" for the Vibrio dataset.
***I activated the Criteria Builder by clicking the New button and named the criterion "Decreased".
***I selected the color for this criterion using the color box.
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] < -0.25 AND [Pvalue] < 0.05</code>
* Upon entering these color sets, I savedthe entire Expression Dataset by selecting Save from the Expression Dataset menu.
** The updated .gex fie produced by this procedure can be found here: [[File:Bpertussis CompiledRawData MS2015-3.gex]]
*links to files created:
** [[File:Bpertussis CompiledRawData MS2015-3.EX.txt]]
** [[File:Bpertussis CompiledRawData MS2015-3.xlsx]]
** [[File:Bpertussis CompiledRawData MS2015-3.txt]]
** [[File:Bpertussis CompiledRawData MS2015-3.gex]]
** [[Media:MAPPFinder results for geneontologyresultsCriterion1-GOtxt.png]]
** [[Media:Gene ontology results.png]]
** [[Media:Errors in GenMAPP.png]]

==12/13/15==

* The above steps were repeated due to the creation of a new .gdb file by the Coder. Once downloading the ________ file, I ran it through the
==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[[[File:Bpertussis compiledrawdata cw20151208.txt]]]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the gene highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
**Here is the screenshot of the final mapp for the ribosome pathway created:
* [[File:RibosomeGenMAPP.png]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the gene highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

Lenaolufson Week 15

2015-12-15T01:27:03Z

Lenaolufson: added protocol--need to edit tense and modify it to my own personal

==12/8/15==
*It was now time for me to prepare my file for GenMAPP, and I did so by the [http://www.openwetware.org/wiki/BIOL398-01/S10:Sample_Microarray_Analysis_Vibrio_cholerae ''Vibrio cholerae'' instructions found here.]]
* I inserted a new worksheet and named it "forGenMAPP".
* I went back to the "statistics" worksheet and Selected All and Copied.
*I went to my new sheet and clicked on cell A1 and selected Paste Special, clicked on the Values radio button, and clicked OK.
** I then deleted the ID columns besides the far left one in column A, and I deleted the second MasterIndex column because it was unnecessary.
** I added a "1" before all of the titles of columns D through I so that none of the columns would have the same names due to the replicates.
* I selected Columns V through Y (all the fold changes). I selected the menu item Format > Cells. Under the number tab, I selected 2 decimal places. I clicked OK.
* I selected all the columns containing p values. I selected the menu item Format > Cells. Under the number tab, I selected 4 decimal places. I clicked OK.
* I deleted the left-most Bonferroni p value column, preserving the one that showed the result of my "if" statement.
* I inserted a column to the right of the "ID" column. I typed the header "SystemCode" into the top cell of this column. I filled the entire column (each cell) with the letter "N".
* I selected the menu item File > Save As, and chose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu.
*After preparing it for GenMAPP, here are the .xls and .txt files:
** [[File:Bpertussis CompiledRawData MS2015-3.xlsx]]
** [[File:Bpertussis CompiledRawData MS2015-3.txt]]
* Then it was time to perform a sanity check, which was done using the [http://www.openwetware.org/wiki/BIOL398-01/S10:Sample_Microarray_Analysis_Vibrio_cholerae ''Vibrio cholerae'' instructions found here.]]
* I opened my spreadsheet and went to the "forGenMAPP" tab.
* I clicked on cell A1 and selected the menu item Data > Filter > Autofilter. Little drop-down arrows appeared at the top of each column. This enabled me to filter the data according to criteria I set.
* I clicked on the drop-down arrow on my "Pvalue" column. I selected "Custom". In the window that appeared, I set a criterion that filtered my data so that the Pvalue was less than 0.05.
**p-value less than 0.05: 1923/3552, 54%
**p-value less than 0.01: 1028/3552, 29%
**p-value less than 0.001: 242/3552, 7%
**p-value less than 0.0001: 40/3552, 1%
**p < 0.05 for the Bonferroni-corrected p value: 9/3552, 0.2%
**p < 0.05 for the Benjamini and Hochberg-corrected p value: 1365/3552, 38%
*Keeping the (unadjusted) "Pvalue" filter at p < 0.05, I filtered the "Avg_ABC_Samples" column to show all genes with an average log fold change greater than zero.
**964/3552, 27%
*Keeping the (unadjusted) "Pvalue" filter at p < 0.05, I filtered the "Avg_ABC_Samples" column to show all genes with an average log fold change less than zero.
**959/3552, 27%
*With an average log fold change of > 0.25 and p < 0.05
**874/3552, 25%
*With an average log fold change of < -0.25 and p < 0.05
**848/3552, 24%
* the fold change cut-off of greater than 0.25 or less than -0.25 and the unadjusted p value cut off of p < 0.05
**1722/3552, 48%
*I then was ready to run my .txt file in GenMAPP.
*I downloaded the .gdb file from my team page [[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/The_Class_Whoopers]] so that I would have it to run GenMAPP with.
* I opened the Expression Dataset Manger from the Data drop-down list in GenMAPP.
* I selected New Dataset from the Expression Datasets menu and choose the tab-delimited text file formatted for GenMAPP (.txt).
* Upon specifying that all data was numerical, the Expression Dataset Manager converted my data to .gex file. This process took approximately one minute to complete. In addition to converting the data to a .gex file, an exceptions file (.EX.txt) was also produced, as 342 errors were reportedly detected in the raw data.
** However, there was a problem at this point because the data set had a few mistakes in it.
* I went back to my data sheet and with the help of Dr. Dahlquist, we discovered that some of the values were incorrect as they displayed: #DIV/0!
** We then replaced all of the #DIV/0! cells with blank cells.
***23 replacements for the #DIV/0!
*I then saved and exported this new .txt file and ran it through GenMAPP again.
* This resulted in fewer errors and everything was smooth.
**339 errors with new .txt file: [[Media:Errors in GenMAPP.png]]
* I customized the new Expression Dataset by creating a Color Sets= with instructions to GenMAPP for displaying data on MAPPs. The new Color Set was entitled "LogFoldChange".
**First, I created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
***I specified the Gene value as "Avg_ABC_Samples" for the Vibrio dataset.
***I activated the Criteria Builder by clicking the New button and named the criterion "Increased".
***I selected the color for this criterion using the color box.
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] > 0.25 AND [Pvalue] < 0.05</code>
**Second, I created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
***I specified the Gene value as "Avg_ABC_Samplesl" for the Vibrio dataset.
***I activated the Criteria Builder by clicking the New button and named the criterion "Decreased".
***I selected the color for this criterion using the color box.
***I stated the criterion as follows and added it to the Criteria List: <code>[Avg_ABC_Samples] < -0.25 AND [Pvalue] < 0.05</code>
* Upon entering these color sets, I savedthe entire Expression Dataset by selecting Save from the Expression Dataset menu.
** The updated .gex fie produced by this procedure can be found here: [[File:Bpertussis CompiledRawData MS2015-3.gex]]
*links to files created:
** [[File:Bpertussis CompiledRawData MS2015-3.EX.txt]]
** [[File:Bpertussis CompiledRawData MS2015-3.xlsx]]
** [[File:Bpertussis CompiledRawData MS2015-3.txt]]
** [[File:Bpertussis CompiledRawData MS2015-3.gex]]
** [[Media:MAPPFinder results for geneontologyresultsCriterion1-GOtxt.png]]
** [[Media:Gene ontology results.png]]
** [[Media:Errors in GenMAPP.png]]

==12/13/15==

* The above steps were repeated due to the creation of a new .gdb file by the Coder. Once downloading the ________ file, I ran it through the
==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[[[File:Bpertussis compiledrawdata cw20151208.txt]]]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the gene highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
**Here is the screenshot of the final mapp for the ribosome pathway created:
* [[File:RibosomeGenMAPP.png]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the gene highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

Gene Database Testing Report- cw20151210

2015-12-14T21:36:05Z

Lenaolufson: /* Nitrogen Cycle Kegg Pathway */ added the explanation for the nitrogen cycle pathway

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*We download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on our computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* We went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, we navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** We clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**We extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, we navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by the browser. Therefore, we had to manually download the file.

====GO OBO-XML====

* We downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
* We extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# We downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# We extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* We launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, we created a new database: ''bpertussis_cw20151210_gmb3build5''.
** We opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** We clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** We clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, we confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, we launched gmbuilder.bat.
* We selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* We selected File > Export to GenMAPP Gene Database... to begin the export process.
* We typed in our coder's name in the owner field (Brandon Klein).
* We selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* We checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, we clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* We ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*We entered the project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*We used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Note: This query was crafted by [[User:Dondi|Dr. Dionisio]].

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

We opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
We visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[[[File:Bpertussis compiledrawdata cw20151208.txt]]]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: BP0101, BP1677, BP0910A, and BP2029A.
****Searching for any of these gene IDs in UniProt returns the message "Sorry, no results found for your search term.":
*****[[File:ErroneousID Uniprot cw20151210.PNG]]
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the gene highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
**Here is the screenshot of the final mapp for the ribosome pathway created:
* [[File:RibosomeGenMAPP.png]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the gene highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase, as well a couple of gray genes that were not significant to the criterion. This nitrogen cycle mapp was created due to the important metabolic processes that occur in order to keep cells alive and reproducing, and specifically the nitrogen metabolism cycle. The genes that displayed red in this mapp had increased expression during the microarray experiment, and from the kegg pathway given for nitrogen metabolism, these genes can be seen to specifically aid in the metabolism of glutamate. Glutamate is important to cells as it plays a role in providing energy to allow the cells to operate correctly, and since the glutamate-related genes that we mapped were increased, it can be determined that glutamate plays a role in supplying the underlying energy to allow for the Bordetella pertussis strains to produce the polysaccharide capsule transport proteins, as studied in the microarray experiment.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

[[Category: Class Whoopers]]

Gene Database Testing Report- cw20151210

2015-12-14T21:23:03Z

Lenaolufson: /* Ribosome Kegg Pathway */ added explanation of the ribosome pathway mapp

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*We download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on our computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* We went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, we navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** We clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**We extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, we navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by the browser. Therefore, we had to manually download the file.

====GO OBO-XML====

* We downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
* We extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# We downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# We extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* We launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, we created a new database: ''bpertussis_cw20151210_gmb3build5''.
** We opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** We clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** We clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, we confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, we launched gmbuilder.bat.
* We selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* We selected File > Export to GenMAPP Gene Database... to begin the export process.
* We typed in our coder's name in the owner field (Brandon Klein).
* We selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* We checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, we clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* We ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*We entered the project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*We used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Note: This query was crafted by [[User:Dondi|Dr. Dionisio]].

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

We opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
We visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[[[File:Bpertussis compiledrawdata cw20151208.txt]]]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: BP0101, BP1677, BP0910A, and BP2029A.
****Searching for any of these gene IDs in UniProt returns the message "Sorry, no results found for your search term.":
*****[[File:ErroneousID Uniprot cw20151210.PNG]]
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the gene highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
**Here is the screenshot of the final mapp for the ribosome pathway created:
* [[File:RibosomeGenMAPP.png]]
** Most of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease, except for the grey colored genes that were not significantly changed in this experiment. Since the genes mapped for the ribosome pathway all appeared to be green, this means that the expression levels of the genes pertaining to the ribosome category all decreased during the microarray experiment. Ribosomes play a key role in the translation process in cells and without them genes are often repressed and unable to perform their proper functions as they are unable to complete the replication processes. The microarray experiment analysis revealed that the absence of a membrane-associated protein named KpsT in B. pertussis, resulted in global down-regulation of gene expression including key virulence genes. The ribosome pathway depicted genes that were decreasing in gene expression, thus linking the translation process to the down-regulated key genes from the experiment because since these genes were lacking a necessary protein to help them perform the proper replication processes, translation did not occur in these genes and thus the ribosomes were not involved, ultimately leading to the decrease in expression of the genes mapped in the ribosome pathway.

====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the gene highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

[[Category: Class Whoopers]]

Gene Database Testing Report- cw20151210

2015-12-14T08:35:27Z

Lenaolufson: /* Creating a Pathway-Based MAPP Using Colored Genes */ added protocol for the colored gene genmapps

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*We download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on our computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* We went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, we navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** We clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**We extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, we navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by the browser. Therefore, we had to manually download the file.

====GO OBO-XML====

* We downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
* We extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# We downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# We extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* We launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, we created a new database: ''bpertussis_cw20151210_gmb3build5''.
** We opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** We clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** We clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, we confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, we launched gmbuilder.bat.
* We selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* We selected File > Export to GenMAPP Gene Database... to begin the export process.
* We typed in our coder's name in the owner field (Brandon Klein).
* We selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* We checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, we clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* We ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*We entered the project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*We used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Note: This query was crafted by [[User:Dondi|Dr. Dionisio]].

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

We opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
We visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

We made a sample MAPP in which gene IDs conforming to the naming conventions of the 5 major gene databases containing ''Bordetella pertussis'' genome data were added. A screenshot of the resulting MAPP is provided below:
[[File:Samplegenemapp.png]]
*Gene IDs:
** '''bp1123''' refers to the OrderedLocusNames gene ID system.
** '''CAE43716''' refers to the EmsemblBacteria gene ID system.
** '''Q7VWE'''5 refers to the UniProt gene ID system.
** '''2665491''' refers to the GeneID system.
** '''NP_881255''' refers to the RefSeq gene ID system.

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[[[File:Bpertussis compiledrawdata cw20151208.txt]]]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: BP0101, BP1677, BP0910A, and BP2029A.
****Searching for any of these gene IDs in UniProt returns the message "Sorry, no results found for your search term.":
*****[[File:ErroneousID Uniprot cw20151210.PNG]]
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
We customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# We created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*We selected the color for this criterion as red using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, we created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*We specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*We activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*We selected the color for this criterion as green using the color box.
#*We stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, we saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated our .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
====Ribosome Kegg Pathway====
* We were able to create a mapp of the ribosome pathway by using the genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Ribosome" that was under section 2.2 Translation and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the ribosome pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the gene highlighted genes on the ribosome pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
**Here is the screenshot of the final mapp for the ribosome pathway created:
* [[File:RibosomeGenMAPP.png]]
** All of the ribosome genes that were generated on this mapp appeared to be the color green, symbolizing a decrease.
====Nitrogen Cycle Kegg Pathway====
* We were also able to create another mapp using the nitrogen cycle pathway genes provided from the http://www.genome.jp/kegg/ website.
** Once accessing the website, we selected KEGG PATHWAY from the main page.
** Next, we scrolled down to "Nitrogen Metabolism" that was under section 1.2 Energy Metabolism and selected it.
** Then, we searched our organism in the drop down menu at the top of the page, and we selected the Bordetella pertussis Tomaha I organism, and clicked "Go".
** This lead us to a page of the nitrogen metabolism pathway with the gene IDs that pertained to our specific organism. We were then able to create a mapp using these genes in GenMAPP.
** Each of the gene highlighted genes on the nitrogen metabolism pathway were entered into the GenMAPP mapp by entering each gene ID and the name given from the Kegg pathway, and then the expression dataset "bpertussis_expressiondataset_cw20151213" was applied to the genes to color code them.
** Here is the screenshot of the final mapp for the nitrogen cycle pathway created:
* [[File:NitrogencycleGenMAPP.png]]
** This mapp displayed both red and green colored genes; the green highlighted genes symbolizing a decrease and the red highlighted genes symbolizing an increase.

===Running MAPPFinder===
*MAPPFinder Procedure
** We launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** We clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** We chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** We checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**We selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]
***The majority of the most significant gene ontology terms pertained to ribosome biosynthesis and translation.

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

[[Category: Class Whoopers]]

Bordetella Pertussis GenMAPP Analysis Deliverables

2015-12-14T08:11:12Z

Lenaolufson: /* Group Files and Datasets */ added the filtered results for criterion0 from genmapp

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species (''.gdb''): [[File:Bpertussis-std cw20151210.zip]]
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx''): [[File:Bpertussis compiledrawdata cw20151208.xlsx]]
* Data file used for import into GenMAPP (''.txt'' or ''.csv''): [[File:Bpertussis compiledrawdata cw20151208.txt]]
* GenMAPP Expression Dataset file (''.gex''): [[File:Bpertussis expressiondataset cw20151213.gex]]
* Exceptions file of data imported into GenMAPP (''.EX.txt''): [[File:Bpertussis expressiondataset exceptions cw20151213.EX.txt]]
* Raw MAPPFinder results files (''-GO.txt''):
** [[File:Bpertussis mappfinderresults cw20151213-criterion0-GO.txt]]
** [[File:Bpertussis mappfinderresults cw20151213-criterion1-GO.txt]]
* ''.gmf'' file: [[File:Bpertussis compiledrawdata cw20151213.gmf]]
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx''):
** [[File:Bpertussis mappfinderresults filtered cw20151213-Criterion1-GO.xlsx]]
** [[File:Bpertussis mappfinderresults filtered cw20151213-Criterion0-GO.xlsx]]
* Sample MAPP file of a relevant biological pathway for your species (''.mapp''): [[File:Bpertussis ribosomepathway cw20151215.mapp]]
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

File:Bpertussis mappfinderresults filtered cw20151213-Criterion0-GO.xlsx

2015-12-14T08:10:25Z

Lenaolufson: mappfinder filtered results for criterion0

mappfinder filtered results for criterion0

Gene Database Testing Report- cw20151210

2015-12-14T07:53:06Z

Lenaolufson: /* Putting a Gene on the MAPP Using the GeneFinder Window */ added the gene Id systems relating to the specific genes on the mapp

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*I download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on my computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* I went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, I navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** I clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**I extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, I navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by my browser. Therefore, I had to manually download the file.

====GO OBO-XML====

* I downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
*I extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# I downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# I extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* I launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, I created a new database: ''bpertussis_cw20151210_gmb3build5''.
** I opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** I clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** I clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, I confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, I launched gmbuilder.bat.
* I selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* I selected File > Export to GenMAPP Gene Database... to begin the export process.
* I typed my name in the owner field (Brandon Klein).
* I selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* I checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, I clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* I ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*I entered my project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*I used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

I opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
I visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

*Screenshot of all of the sample ID's on a MAPP:
* [[File:Samplegenemapp.png]]
** Here each of the different gene ID systems were used to mapp a gene on the sample mapp.
** bp1123 refers to the OrderedLocusNames gene ID system
** CAE43716 refers to the EmsemblBacteria gene ID system
** Q7VWE5 refers to the UniProt gene ID system
** 2665491 refers to the GeneID system
** NP_881255 refers to the RefSeq gene ID system

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[.txt]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: ______, _______A, etc.
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
I customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# I created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*I specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*I activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*I selected the color for this criterion as red using the color box.
#*I stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, I created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*I specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*I activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*I selected the color for this criterion as green using the color box.
#*I stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, I saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated my .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
* [[File:RibosomeGenMAPP.png]]
* [[File:NitrogencycleGenMAPP.png]]

===Running MAPPFinder===
*MAPPFinder Procedure
** I launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** I clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** I chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** I checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**I selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

[[Category: Class Whoopers]]

Gene Database Testing Report- cw20151210

2015-12-14T07:46:22Z

Lenaolufson: /* Putting a Gene on the MAPP Using the GeneFinder Window */ edited the screenshot for the sample Id's to the correct mapp

==Files Asked for in the Gene Database Testing Report==

For convenience, all of the files explicitly asked for in the sections below were compressed together in this file: [[File:Testingreport cw20151210.zip]]

==Pre-requisites==

The following set of software was used in the creation and testing of the ''Bordetella pertussis'' gene database:
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
# Java JDK 1.8 64-bit
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
# Microsoft Access for reading .mdb files

==Gene Database Creation==
===Downloading Data Source Files and GenMAPP Builder===

*I download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
**All files were saved to the folder ''Bklein7_CW\bpertussis_cw20151210'' on my computer's ThawSpace.
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.

====UniProt XML====

* I went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
**From there, I navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
** I clicked on the "Download" button at the top of the page above and selected the following options:
***"Download all"
***"XML" from the "Format" drop-down menu
***"Compressed" format
**I extracted the file using [http://www.7-zip.org/ 7-zip].

====GOA====

* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
*Within the above site, I navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
**This text file was automatically opened by my browser. Therefore, I had to manually download the file.

====GO OBO-XML====

* I downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
*I extracted the file using [http://www.7-zip.org/ 7-zip].

====Downloaded GenMAPP Builder====

# I downloaded the custom version of GenMAPP Builder including the most recent version of the ''Bordetella pertussis'' custom class (Version 3.0.0 Build 5 - cw20151210): [[File:Dist cw20151210.zip]].
# I extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].

===Creating the New Database in PostgreSQL===

* I launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
** On this server, I created a new database: ''bpertussis_cw20151210_gmb3build5''.
** I opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
*** I clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
*** I clicked on the Execute Query icon to run this command.
***In viewing the schema for this database, I confirmed that there were 167 tables after running the above command.

===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===

* To begin, I launched gmbuilder.bat.
* I selected the "Configure Database" option and entered the following information into the fields below:
** Host or address: localhost
** Port number: 5432
** Database name: bpertussis_cw20151210_gmb3build5
** Username: postgres
** Password: Welcome1

===Importing Data into the PostgreSQL Database===

*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
** Selected File > Import UniProt XML...
** Selected File > Import GO OBO-XML...
** Clicked OK to the message asking to process the GO data.
** Selected File > Import GOA...

===Exporting a GenMAPP Gene Database (.gdb)===

* I selected File > Export to GenMAPP Gene Database... to begin the export process.
* I typed my name in the owner field (Brandon Klein).
* I selected the custom profile "Bordetella pertussis, Taxon ID 257313" as the gene database species and then clicked ''Next''.
* The database was saved as ''bpertussis-std_cw20151210''.
* I checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
* Finally, I clicked the "Next" button to begin the export process.

==Gene Database Testing Report==
===Export Information===

Version of GenMAPP Builder: Version 3.0.0 Build 5 - cw20151210

Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room

Postgres Database name: bpertussis_cw20151210_gmb3build5

UniProt XML filename: [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
* UniProt XML version (The version information was found at [http://uniprot.org/news the UniProt News Page]): 2015_12
* UniProt XML download link: [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)]
* Time taken to import: 2.88 minutes
** Note: The import time was similar to that when creating the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (2.59 minute). No interruptions occurred during this process.

GO OBO-XML filename: [[File:Go daily-termdb cw20151210.zip]]
* GO OBO-XML version (The version information was found in the file properties): Last Modified- ‎‎ ‎December ‎10, ‎2015 (TIME?)
* GO OBO-XML download link: [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page]
* Time taken to import: 6.97 minutes
* Time taken to process: 4.52 minutes
** Note: The import and processing times were similar to those for the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (7.08 minutes and 4.42 minutes respectively). No interruptions occurred during these processes.

GOA filename: [[File:145.B pertussis ATCC BAA-589 cw20151210.zip]]
* GOA version (found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): Last Modified- 08-Dec-2015 02:45
* GOA download link: [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I]
* Time taken to import: 0.03 minutes
** Note: The import time was very similar to that of the previous "Bordetella pertussis" gene database: bpertussis-std_cw20151203.gdb (0.04 minutes). No interruptions occurred during this process.

Name of .gdb file: [[File:Bpertussis-std cw20151210.zip]]
* Time taken to export:
** Start time: 1:19 AM
** End time: 2:11 AM
** Elapsed time: 52 minutes
Note: No interruptions occurred during the export process.

===TallyEngine===
* I ran the TallyEngine in GenMAPP Builder and specified the following files:
**XML- [[File:Uniprot-proteome-UP000002676 cw20151210.zip]]
**GO- [[File:Go daily-termdb cw20151210.zip]]
*Results:
**[[File:TallyEngineResults cw20151210.png]]
***All TallyEngine results were consistent across both files.
***The TallyEngine was not customized to reflect the coding changes made to GenMAPP Builder Version 3.0.0 Build 5 - cw20151210.
****Therefore, the total count for "Ordered Locus Names" and "ORF" gene IDs remained 3446. The extra ID that was imported in this build, "BP3167A", was not listed in either of these categories.
****'''Further TallyEngine customization is necessary to raise the count to 3447 gene IDs.'''

===Using XMLPipeDB Match to Validate the XML Results from the TallyEngine===
The following functions were performed using the Windows command line (cmd).
*I entered my project folder using the following command:
cd /d T:\Bklein7_CW\bpertussis_cw20151210
*I used XMLPipeDB match to identify matches of gene IDs in the UniProt XML file that conformed to the following the patterns: "BP####", "BP####.1", "BP####A", and "BP####B". The command used was as follows:
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9](A|B|\.1|)" < "uniprot-proteome%3AUP000002676_cw20151201.xml"

Match Results:
*[[File:Xmlpipedbmatch cw20151203.png]]
**The number of unique matches generated by XMLPipeDB Match, 3447, matched with our expectation. The count includes the total number of ordered locus (3435) and ORF (11) gene IDs along with the unique EnsemblBacteria reference ID "BP3167A".

===Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
We used the SQL "union" operation to count the number of "ordered locus" gene IDs, which conform to the pattern "BP####", in addition to all gene IDs that matched the patterns "BP####A" & "BP####B" (including 11 "ORF" gene IDs and 1 EnsemblBacteria reference ID):

select count(value) from (select value from genenametype where type =
'ordered locus' union select value from propertytype inner join dbreferencetype
on (propertytype.dbreferencetype_property_hjid = dbreferencetype.hjid)
where dbreferencetype.type = 'EnsemblBacteria' and propertytype.type =
'gene ID' and propertytype.value ~ 'BP[0-9][0-9][0-9][0-9](A|B)') as combined;

Results:
*[[File:PostgreSQL Count cw20151210.png]]
* The number of unique matches yielded by this SQL query, 3447, matched the count generated by XMLPipeDB Match. Thus, the locations of all 3447 gene IDs in the PostgreSQL relational database were accounted for here.

===OriginalRowCounts Comparison===

I opened the gene database file [[File:Bpertussis-std_cw20151210.zip]] in Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.

Benchmark .gdb file: [[File:Vc-Std 20151027 TR.gdb]]

"OriginalRowCounts" table from the benchmark and new gdb:
*[[File:ComparisonToBenchmark cw20151210.PNG]]
**All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' gene database, ''bpertussis-std_cw20151210''. This confirmed that all expected tables were successfully created.
**The "OrderedLocusNames" table count is listed as 3447. '''This count demonstrates that the missing ID, "BP3167A", was successfully added to the export (confirmed below).'''
***[[File:BP3167A Confirmed cw20151210.PNG]]

Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row count for ''bpertussis-std_cw20151210'' is highlighted in yellow.

===Visual Inspection===
I visually inspected individual tables within [[File:Bpertussis-std_cw20151210.zip]] using Microsoft Access.

*Systems Table
**35 gene ID systems were listed, 11 of which were used in the creation of this .gdb file and listed the appropriate import date (12/10/2015).
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
**The "OrderedLocusNames" listing properly displayed customizations to the ''Bordetella pertussis'' species profile.
***In this row, the species was listed correctly as "Bordetella pertussis".
***In this row, the link corresponded to the ''Bordetella pertussis'' database at GeneDB. The link was as follows: http://www.genedb.org/gene/~;jsessionid=A06A0EFE93C64E476380393D4CBEFA69?actionName=%2FQuery%2FquickSearch&resultsSize=1&taxonNodeName=Bpertussis.
*UniProt Table
**This table contained 3258 entries with 6 character IDs.
**All ID's in the UniProt table conform to the following pattern:
*** [[File:UniProt Ascension Number info.PNG]]
*RefSeq Table
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
***"NP_" and "YP_" Prefixes
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
***"WP_" Prefixes
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
***Overall, every entry in the ID column was an expected value.
*OrderedLocusNames Table
**This table contained 3447 entries (consistent with the XMLPipeDB Match result).
**The IDs were copied into an Excel document for analysis:
***3434 IDs conformed to the pattern "BP####".
***11 IDs conformed to the pattern "BP####A".
****This included 10 ORF gene IDs & "BP3167A" (reference to an EnsemblBacteria ID).
***1 ID exhibited the pattern "BP####B".
****This corresponded to an ORF gene ID.
***1 ID exhibited the pattern "BP####.1".
****This ID was the manner in which UniProt classified "BP3167A".

==bpertussis-std_cw20151210.gdb Use in GenMAPP==

The following analysis was conducted in GenMAPP Version 2.1. Within GenMAPP, the ''Bordetella pertussis'' gene database was loaded by selecting Data > Choose Gene Database and then selecting the file ''bpertussis-std_cw20151210.gdb''.

===Putting a Gene on the MAPP Using the GeneFinder Window===

*Screenshot of all of the sample ID's on a MAPP:
* [[File:Samplegenemapp.png]]

Note: Gene IDs tested from the above gene ID systems all had complete Backpages and were successfully placed on the MAPP.

===Creating an Expression Dataset in the Expression Dataset Manager===
The file [[.txt]] was used to create an expression dataset in GenMAPP.

*Total Number of Gene IDs Imported
** 3211 of the 3552 gene IDs from the microarray dataset were imported into the expression dataset.
**There were 341 exceptions during the creation of the expression dataset. A screenshot of the error message is shown here:
***[[File:Errors in genmapp.png]]
*Investigating Errors in the Exceptions File (EX.txt)
**All 341 exceptions triggered the following error message: "Gene not found in OrderedLocusNames or any related system."
**Gene IDs that triggered this error message conformed to the patterns "BP####" and "BP####A", indicating that no unique gene ID patterns were the cause of these errors.
***Example gene IDs that triggered this error are the following: ______, _______A, etc.
***The 341 gene IDs were copied into a new Excel file and compared to the gene IDs present in the file [[File:Bpertussis-std_cw20151210.zip]] (adapted from the "OrderedLocusNames" table in Microsoft Access).
****None of the 341 gene IDs were present in the .gdb file.
***The 341 gene IDs were each individually searched for in UniProt.
****None of the 341 gene IDs retrieved results in UniProt.
**'''Conclusion: All gene IDs that triggered errors were not present in the original UniProt XML file.'''

===Coloring a MAPP with Expression Data===

====Creating a New Color Set====
I customized the new Expression Dataset by creating a new color set entitled "LogFoldChange".
# I created a criterion for this color set to label genes that demonstrated a significant ''increase'' in their expression.
#*I specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*I activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Increased".
#*I selected the color for this criterion as red using the color box.
#*I stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] > 0.25 AND [Pvalue] < 0.05</code>.
#Second, I created a criterion for this color set to label genes that demonstrated a significant ''decrease'' in their expression.
#*I specified the gene value as "Avg_ABC_Samples" for the ''Bordetella pertussis'' microarray dataset.
#*I activated the ''Criteria Builder'' by clicking the ''New'' button and named the criterion "Decreased".
#*I selected the color for this criterion as green using the color box.
#*I stated the criterion as follows and added it to the Criteria List: <code>[Avg_LogFC_all] < -0.25 AND [Pvalue] < 0.05</code>
# Upon entering these color sets, I saved the entire Expression Dataset by selecting Save from the Expression Dataset menu. This effectively updated my .gex file with the new Color Set.

Screenshot of Color Set criteria:
*[[File:Expressioncolorset.png]]

Note: No errors were encountered in the creation of the Color Set.

====Creating a Pathway-Based MAPP Using Colored Genes====
* [[File:RibosomeGenMAPP.png]]
* [[File:NitrogencycleGenMAPP.png]]

===Running MAPPFinder===
*MAPPFinder Procedure
** I launched the MAPPFinder program from within GenMAPP and ensured that the ''bpertussis-std_cw20151210.gdb'' gene database was still loaded into GenMAPP.
** I clicked on the button "Calculate New Results" followed by "Find File", at which point I specified the .gex file updated during the creation of the "LogFoldChange" color set.
** I chose to apply both the "Increased" and "Decreased" criteria present within the LogFoldChange color set to the data.
** I checked the boxes next to "Gene Ontology" and "p value", specified the results file, and then clicked "Run MAPPFinder".
***This analysis took several minutes to complete.
*MAPPFinder Analysis Results
**I selected "Show Ranked List" to see a list of the most significant Gene Ontology terms. A screenshot of this output is shown below:
**[[File:Mappfinderrankedlist.png]]

Note: The MAPPFinder analysis took approximately 8 minutes to complete. No errors were encountered in the process. MAPPFinder thus was confirmed to work with the ''Bordetella pertussis'' gene database.

=== Compare Gene Database to Outside Resource===

[[Category: Class Whoopers]]