Difference between revisions of "Bklein7 Week 12"

From LMU BioDB 2015
Jump to: navigation, search
(expanded gene database testing results)
(added presentation pdf)
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==TO DO==
+
==Genome Sequencing Paper PowerPoint Presentation==
*Check Export!
+
Presentation File (PDF): [[File:Genomepaper cw20151116.pdf]]
*ZIP and upload files used in creating the GDB
+
==Quality Assurance Work==
*Create a new page for the gene testing report (_cw20151119) and link to it from this page
+
I created and tested our first ''Bordetella pertussis'' gene database, tagged '''cw20151119'''.
*Follow instructions on the Coder page and update electronic lab notebook
+
*Gene Database v1: [[File:Bpertussis-std cw20151119.zip]]
**Consider dividing this page into coder and QA sections
+
*[[Gene Database Testing Report- cw20151119]] (I authored sections 1-4.6. Lena authored sections 4.7 & 4.8)
 +
Work Log:
 +
*Thursday, November 19: I followed the import-export process for the creation of our first ''Bordetella pertussis'' gene database. My protocol was documented on the Gene Database Testing Report page for this database.
 +
*Monday, November 23: I accessed the exported database and went through counting protocols to evaluate its content. Results were posted on the Gene Database Testing Report page for this database.
  
==Files Asked for in the Gene Database Testing Report==
+
==Coder Work==
For convenience, all of the files explicitly asked for in the "Gene Database Testing Report" section were compressed together in this file:
+
Before proceeding, I designated my personal laptop as my development computer.
 +
=== GitHub Repository Clone Setup ===
 +
GitHub Information:
 +
*My GitHub account: [https://github.com/bklein7 bklein7]
 +
*Projects in which I am listed as a developer: [https://github.com/lmu-bioinformatics/xmlpipedb LMU Bioinformatics XMLPipeDB Project]
 +
**My Team: [https://github.com/orgs/lmu-bioinformatics/teams/the-class-whoopers The Class Whoopers]
 +
**My Branch: [https://github.com/lmu-bioinformatics/xmlpipedb/tree/b-pertussis b-pertussis]
 +
GitHub Clone Setup:
 +
#I designated a folder on my Desktop entitled "B. pertussis Project" as the location for my local copy of the [https://github.com/lmu-bioinformatics/xmlpipedb XMLPipeDB GitHub repository]. To enter this location, I opened Terminal and used the following command: <pre>cd /Users/brandonklein/Desktop/B.\ pertussis\ Project</pre>
 +
#I cloned the repository:<pre>git clone https://github.com/lmu-bioinformatics/xmlpipedb.git</pre>
 +
#I entered the clone folder: <pre>cd xmlpipedb</pre>
 +
#I switched to my branch:<pre>git checkout b-pertussis</pre>
  
==Pre-requisites==
+
=== “Developer Rig” Setup and Initial As-Is Build ===
The following set of software was used in the creation and testing of the ''Vibrio cholerae'' gene database:
+
Necessary software was downloaded:
 +
* Java developer tools: [http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html JDK 8] (which, at this writing, is ''JDK 8u65'')
 +
* Any tool that can unpack .gz and .zip files: [http://www.kekaosx.com/en/ Keka] (listed as an equivalent software to [http://www.7-zip.org/ 7-zip] for Mac OS X)
 +
* ''XMLPipeDB Match'' utility
 +
* Development environment: [http://www.eclipse.org Eclipse IDE for Java EE Developers]
 +
==== Eclipse Workspace Setup ====
 +
# I ran Eclipse.
 +
# When prompted to specify a Workspace, I selected my "xmlpipedb" repository clone folder and clicked "Open".
 +
# I verified that my repository clone folder was listed as the ''Workspace:'' and clicked ''OK''.
 +
# When presented with the introductory display, I clicked the ''Workbench'' button.
 +
#* This took me to an empty developer area:
 +
#* [[File:Screen Shot 2015-11-23 at 5.10.28 PM.png]]
 +
==== Java Project Setup ====
 +
# I right-clicked within the empty ''Project Explorer'' tab and chose '''New > Project…''' from the menu that appeared.
 +
# I chose ''Java Project'' from the list of “wizards” and clicked on the ''Next >'' button.
 +
# On the next panel, I entered <code>gmbuilder</code> as the ''Project name:''.
 +
#* The ''JRE'' section showed Java 1.8, confirming that my version of Java was up to date.
 +
# I click on the ''Finish'' button.
 +
#* When asked if I wanted to open the “Java perspective,” I responded with ''Yes''.
 +
# The ''gmbuilder'' project folder was now visible in the ''Project Explorer'' tab.
 +
#* The ''gmbuilder'' project folder did not show a red ''x'' icon.
  
# [http://www.7-zip.org/ 7-zip]tool that for unpacking .gz and .zip files
+
=== Initial Build ===
# [http://www.postgresql.org PostgreSQL] on Windows (version 9.4.x)
+
# I opened the ''gmbuilder'' project by clicking on the gray triangle to the left of its name.
# [https://sourceforge.net/projects/xmlpipedb/files/ GenMAPP Builder]
+
# I right-clicked on ''build.xml'' listing and chose '''Run As > Ant Build...''' from the menu that appears.
# Java JDK 1.8 64-bit
+
# In the ''Edit Configuration'' dialog that appears, I entered the "Targets" tab. There, I checked on the ''clean'' and ''dist'' items in the ''Targets'' tab. The ''Target execution order'' section near the bottom of the dialog displayed ''clean, dist''.
# [https://github.com/GenMAPPCS/genmapp GenMAPP 2]
+
# I clicked the ''Run'' button. After 3 seconds of processing, the build was successful.
# [https://sourceforge.net/projects/xmlpipedb/files/ XMLPipeDB match utility] for counting IDs in XML files
+
# When this was done, I right-clicked on the ''gmbuilder'' project folder and chose '''Refresh''' from the menu that appears.
# Microsoft Access for reading .mdb files
+
# A ''dist'' folder was now present inside the ''gmbuilder'' project folder. This was my personally-built copy of ''GenMAPP Builder''. Its contents correspond to the extracted contents of the ''gmbuilder-3.0.0-build-5.zip'' file that was downloaded in class.
 
+
#*Screenshot of my copy of gmbuilder within the Eclipse working environment:
==Gene Database Creation==
+
#*[[File:Eclipse gmbuilder Ant Build.png]]
===Downloading Data Source Files and GenMAPP Builder===
+
 
+
*I download the UniProt XML, GOA, and GO OBO-XML files for ''Bordetella Pertussis'' along with the GenMAPP Builder program.
+
**All files were saved to the folder ''Bklein7_CW'' on my computer's ThawSpace.
+
**Files that required extraction were unzipped using [http://www.7-zip.org/ 7-zip].
+
**Data files that remained in a folder after unzipping were removed from their folders to facilitate organization and command line processing.
+
 
+
====UniProt XML====
+
 
+
* I went to the [http://www.uniprot.org/taxonomy/complete-proteomes UniProt Complete Proteomes] page.
+
**From there, I navigated to the complete proteome download page for [http://www.uniprot.org/proteomes/UP000002676 Bordetella pertussis (strain Tohama I / ATCC BAA-589 / NCTC 13251)].
+
** I clicked on the "Download" button at the top of the page above and selected the following options:
+
***"Download all"
+
***"XML" from the "Format" drop-down menu
+
***"Compressed" format
+
**I extracted the file using [http://www.7-zip.org/ 7-zip].
+
 
+
====GOA====
+
 
+
* UniProt-GOA files can be downloaded from the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/ UniProt-GOA ftp site].
+
*Within the above site, I navigated to the [http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/145.B_pertussis_ATCC_BAA-589.goa for ''Bordetella pertussis'' strain Tohama I].
+
**This text file was automatically opened by my browser. Therefore, I had to manually download the file.
+
 
+
====GO OBO-XML====
+
 
+
* I downloaded the GO OBO-XML formatted file from the [http://geneontology.org/page/download-ontology#Legacy_Downloads Gene Ontology legacy download page].
+
*I extracted the file using [http://www.7-zip.org/ 7-zip].
+
 
+
====Downloaded GenMAPP Builder====
+
 
+
# I downloaded the GenMAPP Builder zip folder: [https://github.com/lmu-bioinformatics/xmlpipedb/releases/download/untagged-bd04fffc4da853fedf30/gmbuilder-3.0.0-build-5.zip Download gmbuilder-3.0.0-build-5.zip].
+
# I extracted the GenMAPP Builder folder using [http://www.7-zip.org/ 7-zip].
+
 
+
===Creating the New Database in PostgreSQL===
+
 
+
* I launched ''pgAdmin III'' and connected to the PostgreSQL 9.4 server (localhost:5432).
+
** On this server, I created a new database: ''bpertussis_cw20151119_gmb3build5''.
+
** I opened the SQL Editor tab to use an XMLPipeDB query to create the tables in the database.
+
*** I clicked on the Open File icon and selected the file ''gmbuilder.sql''. This imported a series of SQL commands into the editor tab.
+
*** I clicked on the Execute Query icon to run this command.
+
***In viewing the schema for this database, I confirmed that there were 167 tables after running the above command.
+
 
+
===Configuring GenMAPP Builder to Connect to the PostgreSQL Database===
+
 
+
* To begin, I launched gmbuilder.bat.
+
* I selected the "Configure Database" option and entered the following information into the fields below:
+
** Host or address: localhost
+
** Port number: 5432
+
** Database name: bpertussis_cw20151119_gmb3build5
+
** Username: postgres
+
** Password: Welcome1
+
 
+
===Importing Data into the PostgreSQL Database===
+
 
+
*The downloaded data files for ''Bordetella pertussis'' were specified and imported into the database by clicking on the following buttons:
+
** Selected File > Import UniProt XML...
+
** Selected File > Import GO OBO-XML...
+
** Clicked OK to the message asking to process the GO data.
+
** Selected File > Import GOA...
+
 
+
===Exporting a GenMAPP Gene Database (.gdb)===
+
 
+
* I selected File > Export to GenMAPP Gene Database... to begin the export process.
+
* I typed my name in the owner field (Brandon Klein).
+
* I selected "Bordetella pertussis (strain Tohama I/ATCC BAA-589/NCTC 13251), Taxon ID 257313" as the gene database species and then clicked ''Next''.
+
* The database was saved as ''bpertussis-std_cw20151119''.
+
* I checked the boxes for exporting all Molecular Function, Cellular Component, and Biological Process Gene Ontology Terms.
+
* Finally, I clicked the "Next" button to begin the export process.
+
 
+
==Gene Database Testing Report==
+
===Export Information===
+
 
+
Version of GenMAPP Builder: Version 3.0.0 Build 5
+
 
+
Computer on which export was run: Seaver 120- Last computer on the right in the row farthest from the front of the room
+
 
+
Postgres Database name: bpertussis_cw20151119_gmb3build5
+
 
+
UniProt XML filename (give filename and upload and link to compressed file):
+
* UniProt XML version (The version information can be found at [http://uniprot.org/news the UniProt News Page]):
+
* UniProt XML download link:
+
* Time taken to import: 2.60 minutes
+
** Note: The import time was similar to that of ''V. cholerae'' in Week 9 (2.92 minutes). No interruptions occurred during this process.
+
 
+
GO OBO-XML filename (give filename and upload and link to compressed file):
+
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped):
+
* GO OBO-XML download link:
+
* Time taken to import: 6.99 minutes
+
* Time taken to process: 4.48 minutes
+
** Note: The import and processing times were similar to those for ''V. cholerae'' in Week 9 (6.88 minutes and 4.49 minutes respectively). No interruptions occurred during these processes.
+
 
+
GOA filename (give filename and upload and link to compressed file):
+
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]):
+
* GOA download link:
+
* Time taken to import: 0.04 minutes
+
** Note: The import time was similar to that of ''V. cholerae'' in Week 9 (0.06 minutes). No interruptions occurred during this process.
+
 
+
Name of .gdb file (give filename and upload and link to compressed file):
+
* Time taken to export:
+
** Start time: 4:06 PM
+
** End time: 4:46 PM
+
** Elapsed time: 40 minutes
+
Note: All export windows remained open when I returned to check the export status. No interruptions occurred during the export process.
+
 
+
===TallyEngine===
+
 
+
* I ran the TallyEngine in GenMAPP Builder and specified the following files:
+
**XML- uniprot-proteome%3AUP000002676_cw20151119.xml
+
**GO- go_daily-termdb_cw20151119.obo-xml
+
*Results:
+
**[[File:TallyEngine cw20151119.png]]
+
**All tally results were consistent across both files.
+
=== Using XMLPipeDB match to Validate the XML Results from the TallyEngine===
+
The following functions were performed using the Windows command line (cmd).
+
*I entered my project folder using the following command:
+
cd /d T:\Bklein7_CW
+
*I used XMLPipeDB match to identify matches of any ordered locus name following the pattern "BP####" in the UniProt XML file. The command sequence used is as follows:\
+
java -jar xmlpipedb-match-1.1.1.jar "BP[0-9][0-9][0-9][0-9]" < "uniprot-proteome%3AUP000002676_cw20151119.xml"
+
*Match Results:
+
**[[File:XMLPipeDBMatch cw20151119.png]]
+
**The total number of unique matches listed above, 3438, differs from the Order Locus Names count of 3435 produced by the Tally Engine. Thus, 3 gene IDs present in the original XML file were not imported into ''bpertussis-std_cw20151119.gdb''.
+
 
+
=== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine===
+
I ran a SQL query designed to match the pattern BP####:
+
 
+
select count (*) from genenametype where type = 'ordered locus' and value ~ 'BP[0-9][0-9][0-9][0-9]';
+
 
+
Results:
+
*[[File:SQLQuery cw20151119.png]]
+
* The number of unique matches yielded by this SQL query, 3435, matched that produced by the Tally Engine. However, this count was also 3 less than that yielded by the XMLPipeDB Match result reported above. This further indicates that there was an error present in importing all gene IDs from the original XML file.
+
 
+
===OriginalRowCounts Comparison===
+
 
+
I opened the gene database file ''bpertussis-std_cw20151119'' in  Microsoft Access and assessed the "OriginalRowCounts" table to see if the expected tables were listed with the expected number of records. The contents of this table were compared to the ''OriginalRowCounts'' table of an existing .gdb file created during Week 9.
+
+
Benchmark .gdb file: ''Vc-Std_20151027_TR''
+
 
+
"OriginalRowCounts" table from the benchmark and new gdb:
+
*[[File:OriginalRowCountsComparison cw20151119.PNG]]
+
*All 52 tables present in the 2015 ''Vibrio cholerae'' database were also present in the ''B. pertussis'' database tagged _cw20151119. This confirmed that all expected tables were successfully created.
+
*Further, the "OrderedLocusNames" count of 3435 generated by the Tally Engine was confirmed to reflect that actual contents of the database.
+
 
+
Note: The "OriginalRowCounts" tables were too large to screenshot. To circumvent this problem and facilitate the comparison, I copied the "OriginalRowCounts" tables from both gene databases into an Excel file and zoomed out. The above screenshot was taken from this Excel file. The "OrderedLocusNames" row counts are highlighted in yellow.
+
 
+
===Visual Inspection===
+
I visually inspected individual tables within ''bpertussis-std_cw20151119'' using Microsoft Access.
+
*Systems Table
+
**35 gene ID systems were listed, 11 of which listed the appropriate import date (11/19/2015)
+
***All gene ID systems relevant to ''B. pertussis'' were listed. This includes: EMBL, EnsemblBacteria, GeneID, GeneOntology, InterPro, OrderedLocusNames, Pfam, RefSeq, and UniProt.
+
***This result corresponded with that of the benchmark .gdb file listed in the "OriginalRowCounts Comparison" section.
+
*UniProt Table
+
**This table contained 3258 entries with 6 character IDs.
+
**All ID's in the UniProt table conform to the following pattern: [[File:UniProt Ascension Number info.PNG]]
+
**There are no apparent issues with the 3258 entries. However, it is curious that only 3258 out of 3438 IDs identified through XMLPipeDB Match made it into the database. This suggests the need for gmbuilder coding changes.
+
*RefSeq Table
+
**This table contained 6627 entries. All IDs began with one of three prefixes: "NP_", "YP_", or "WP_". The meanings of these prefixes can be found in the RefSeq documentation found [http://www.ncbi.nlm.nih.gov/books/NBK50679/ here].
+
***"NP_" and "YP_" Prefixes
+
****Refer to proteins. There are 3410 "NP_" IDs and 7 "YP_" IDs.
+
***"WP_" Prefixes
+
****Refer to "autonomous non-redundant proteins that are not yet directly annotated on a genome". There were 3210 IDs with the "WP_" prefixes.
+
***Overall, every entry in the ID column was an expected value.
+
*OrderedLocusNames Table
+
**This table contained 3435 entries (consistent with Tally Engine & SQL counts).
+
**The IDs were copied into an Excel document for analysis:
+
***3434 IDs conformed to the pattern "BP####"
+
***1 ID was unique: "BP3167.1"
+
 
+
Note:
+
 
+
===.gdb Use in GenMAPP===
+
 
+
<!--Need to add more instructions here.-->
+
 
+
While the above sections perform quality assurance on the exported Gene Database via verifying ID counts, the "proof in the pudding" is to actually use the Gene Database in GenMAPP.  You can follow the instructions in [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] to verify that the Gene Database works in GenMAPP.  In this case, the emphasis is not on the findings of the data analysis itself, but that the Gene Database functions appropriate in GenMAPP.
+
 
+
For assistance with using the GenMAPP program, the GenMAPP Help is very extensive.  To access it within GenMAPP, go to the menu item Help > GenMAPP Help and either browse or search for your topic of interest.
+
 
+
Note:
+
 
+
====Putting a gene on the MAPP using the GeneFinder window====
+
 
+
* In the main GenMAPP Drafting Board window, left-click on the icon for "Gene" in the upper left corner of the window.  Click on the Drafting Board to place the Gene on the MAPP.  Now, right-click on the gene to access the GeneFinder window.  Type or paste a gene ID into the Gene ID field.  Select the appropriate Gene ID system from the drop-down menu and click the Search button.  For example, for ''Vibrio cholerae'', you could search for the ID "VC0028", which is an OrderedLocusNames ID.  Once the ID has been found, click the OK button to return to the Drafting Board window.
+
** For the Final Project, you will need to try a sample ID from each of the gene ID systems, not just OrderedLocusNames.
+
* Open the Backpage by left-clicking on the gene box on the Drafting Board to see if all of the cross-referenced IDs that are supposed to be there are there.
+
 
+
Note:
+
 
+
====Creating an Expression Dataset in the Expression Dataset Manager====
+
 
+
* How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset.  Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?
+
 
+
Note:
+
 
+
====Coloring a MAPP with expression data====
+
 
+
Note:
+
 
+
====Running MAPPFinder====
+
 
+
Note:
+
 
+
=== Compare Gene Database to Outside Resource===
+
 
+
'''''Note:''''' This section applies to the Group Final Project and does not need to be completed for the [[Week 9]] assignment.  ''&mdash; [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 15:46, 2 November 2015 (PST)''
+
 
+
The OrderedLocusNames IDs in the exported Gene Database are derived from the UniProt XML. It is a good idea to check your list of OrderedLocusNames IDs to see how complete it is using the original source of the data (the sequencing organization, the MOD, etc.)  Because UniProt is a protein database, it does not reference any non-protein genome features such as genes that code for functional RNAs, centromeres, telomeres, etc.
+
 
+
Note:
+
 
+
[[Category:Group Projects]]
+
  
 
==Links==
 
==Links==

Latest revision as of 08:02, 24 November 2015

Genome Sequencing Paper PowerPoint Presentation

Presentation File (PDF): File:Genomepaper cw20151116.pdf

Quality Assurance Work

I created and tested our first Bordetella pertussis gene database, tagged cw20151119.

Work Log:

  • Thursday, November 19: I followed the import-export process for the creation of our first Bordetella pertussis gene database. My protocol was documented on the Gene Database Testing Report page for this database.
  • Monday, November 23: I accessed the exported database and went through counting protocols to evaluate its content. Results were posted on the Gene Database Testing Report page for this database.

Coder Work

Before proceeding, I designated my personal laptop as my development computer.

GitHub Repository Clone Setup

GitHub Information:

GitHub Clone Setup:

  1. I designated a folder on my Desktop entitled "B. pertussis Project" as the location for my local copy of the XMLPipeDB GitHub repository. To enter this location, I opened Terminal and used the following command:
    cd /Users/brandonklein/Desktop/B.\ pertussis\ Project
  2. I cloned the repository:
    git clone https://github.com/lmu-bioinformatics/xmlpipedb.git
  3. I entered the clone folder:
    cd xmlpipedb
  4. I switched to my branch:
    git checkout b-pertussis

“Developer Rig” Setup and Initial As-Is Build

Necessary software was downloaded:

  • Java developer tools: JDK 8 (which, at this writing, is JDK 8u65)
  • Any tool that can unpack .gz and .zip files: Keka (listed as an equivalent software to 7-zip for Mac OS X)
  • XMLPipeDB Match utility
  • Development environment: Eclipse IDE for Java EE Developers

Eclipse Workspace Setup

  1. I ran Eclipse.
  2. When prompted to specify a Workspace, I selected my "xmlpipedb" repository clone folder and clicked "Open".
  3. I verified that my repository clone folder was listed as the Workspace: and clicked OK.
  4. When presented with the introductory display, I clicked the Workbench button.
    • This took me to an empty developer area:
    • Screen Shot 2015-11-23 at 5.10.28 PM.png

Java Project Setup

  1. I right-clicked within the empty Project Explorer tab and chose New > Project… from the menu that appeared.
  2. I chose Java Project from the list of “wizards” and clicked on the Next > button.
  3. On the next panel, I entered gmbuilder as the Project name:.
    • The JRE section showed Java 1.8, confirming that my version of Java was up to date.
  4. I click on the Finish button.
    • When asked if I wanted to open the “Java perspective,” I responded with Yes.
  5. The gmbuilder project folder was now visible in the Project Explorer tab.
    • The gmbuilder project folder did not show a red x icon.

Initial Build

  1. I opened the gmbuilder project by clicking on the gray triangle to the left of its name.
  2. I right-clicked on build.xml listing and chose Run As > Ant Build... from the menu that appears.
  3. In the Edit Configuration dialog that appears, I entered the "Targets" tab. There, I checked on the clean and dist items in the Targets tab. The Target execution order section near the bottom of the dialog displayed clean, dist.
  4. I clicked the Run button. After 3 seconds of processing, the build was successful.
  5. When this was done, I right-clicked on the gmbuilder project folder and chose Refresh from the menu that appears.
  6. A dist folder was now present inside the gmbuilder project folder. This was my personally-built copy of GenMAPP Builder. Its contents correspond to the extracted contents of the gmbuilder-3.0.0-build-5.zip file that was downloaded in class.
    • Screenshot of my copy of gmbuilder within the Eclipse working environment:
    • Eclipse gmbuilder Ant Build.png

Links

Assignments Pages

Individual Journal Entries

Shared Journal Entries