LMU BioDB 2015 - User contributions [en]

Jkuroda Week 15

2015-12-18T21:50:12Z

Jkuroda: /* Individual Assessment & Reflection */ finishing reflection

==Log==
* Ran some incomplete statistical analysis data from the GenMAPP users through creating a new expression dataset and generated an exception file which found some issues with our database.
* The first time we ran it, there were exceptions for every single gene, because we did not compensate for the underscore in the ID. After inserting the underscore after the 'SO', we were able to find the actual errors.
* First of all, there are 5408 genes listed in their data, compared to the 4196 genes we have in our database.
* There are 760 gene IDs that are in the form SO_####F, which are genes that don't exist in our database.
* There are 681 gene IDs that are in a 'normal' form (either SO_#### or SO_A####) but do not exist in our database.
* For some of the gene IDs that have 'F's, there are multiple genes of the same ID.
* We attempted to do a batch search on Uniprot of all 1441 missing IDs, and got zero results for them in their database. Furthermore, we did a spot check by searching every 100 IDs or so in the Uniprot KB and found that none of the IDs we searched exist.
* We also searched for the 'F' IDs in our MOD and none of them exist in that either.
* After this analysis, we have come to the conclusion that these 1441 IDs can be safely ignored, since they do not exist in Uniprot. We will simply need to modify our code to account for the absence of an underscore, much like we did with Vibrio Cholerae.
* In class on 12/10/15, we worked on figuring out the corrections for our GenMAPP code and made a new dataset using the finished data from the GenMAPP users.
* Then we ran GenMAPP Finder and ran into an issue because we were using the wrong column.
* Our group met up on 12/12/15 and continued working.
* We were able to run MAPPFinder successfully and generated all of the necessary files for our deliverables.
* As of now, we are waiting to find out how to get a sample MAPP file of a relevant biological pathway.
* We are going to individually work on our presentation slides and go over the entire thing on Monday night.

==Individual Assessment & Reflection==
=== Statement of Work ===

* Describe exactly what you did on the project.
We first downloaded the UniProt XML proteome set file (UniProt release 2015_10), GO association file (GOA Proteome Sets 124), and the GO OBO-XML file (version 2015-11-01) on November 20, 2015. Next, we created a new database in PostgreSQL by executing the sql code taken from the sql folder of the latest GenMAPP Builder build. This code was run in PostgreSQL to create 167 empty tables. Now that we set the foundation for our database, we configured GenMAPP Builder to connect to our PostgreSQL database and imported the UniProt XML file, GOA file, and GO OBO-XML file using GenMAPP Builder. We were now able to export a GenMAPP gene database, making sure that it also exported all molecular function, cellular component, and biological process gene ontology terms. This process took one hour and 18 minutes.

Inspecting and validating our gene database was a long but significant process. Although we had successfully exported the database, it would mean nothing unless we verified that the data within the database was valid and accurate. The first check we made used the TallyEngine in GenMAPP Builder to record the number of records for UniProt and GO in the XML data and in the Postgres databases. The table (image X) we got from running TallyEngine confirmed that the XML and PostGres Ordered Locus counts were both 4196. The next check we made used the XMLPipeDB Match function to validate the results from the TallyEngine table. Initially, the regEx pattern we used only caught 4079 IDs because there were over 100 Ordered Locus names that contained the extra character ‘A’ in their ID. So we accounted for those IDs in the regEx pattern and got a count of 4207, which was 11 more than we were expecting. A quick look at the raw XML file told us that these 11 IDs were not picked up by the TallyEngine because they were missing gene tags, so the XMLPipeDB Match function recognized the pattern in another section that TallyEngine did not check. Our next check used an SQL query to validate the PostgreSQL database results from the TallyEngine. Using a regEx pattern similar to the one used for XMLPipeDB Match, we were able to get a confirmed count of 4196 IDs. Finally, we made a visual inspection of the gene database itself using Microsoft Access. We checked the UniProt, RefSeq, and OrderedLocusNames tables to make sure all of the IDs were in the correct form and found that there were no discrepancies.

To come to a logical conclusion in regards to the 11 IDs that were missing a gene tag in the XML file, we simply searched for each ID on the UniProt website and found that they were part of the "STRING" protein-protein interaction database. This meant that we could safely ignore these IDs in our database.
* Provide references or links to artifacts of your work, such as:
** Wiki pages
***[[Heavy Metal HaterZ Deliverables]]
***[[Gene Database Testing Report - Heavy Metal HaterZ]]
** Other files or documents
*** All relevant files can be found on our deliverables page.
** Code or scripts
*** The match command and SQL query can be found in our Gene Database Testing Report.

=== Assessment of Project ===

* Give an objective assessment of the success of your project workflow and teamwork.
** I liked the way our team worked together on this project, mostly because of the fact that we got along well. It was easy to ask for help, and we communicated effectively, which made the entire process much smoother. We had similar ways of working, and it was fun to go through this project with them.
* What worked and what didn't work?
** Overall, it was a solid project; we didn't run into too many issues, and when we did, they were easily overcome. I would say two of the biggest bumps were the "missing" 11 IDs that were found by XMLPipeDB Match and the "extra" ~1400 IDs that were present in the microarray data but not in UniProt. These issues were overcome by consulting our professors and doing some extra research.
* What would you do differently if you could do it all over again?
** I would probably want to work with my group in person more often than we did, since I found that we were much more effective and productive when all four of us were together working on a part of the project.
* Evaluate the Gene Database Project and Group Report in the following areas:
*# Content: What is the quality of the work?
*#* I would say the level of quality is above average, because we spent a good amount of time making sure that there were no errors in our project, and if there were, we would have an explanation for why it existed. Our group report reflects the time and effort we poured into our gene database project, and so I would say it is quality work.
*# Organization: Comment on the organization of the project and of your group's wiki pages.
*#* From the very beginning, we made sure to have a place on our wiki where all of our files would be consolidated. This made working in separate areas a breeze, and after we had made some progress, we were able to put our final files into our deliverables page. The wiki template we made for our team is organized and easy to navigate, and we used it on each of our auxiliary pages.
*# Completeness: Did your team achieve all of the project objectives? Why or why not?
*#* Yes. Completing the project objectives was not a particularly difficult task because our team worked well together. We all knew what roles we had, so there was no problem in getting the tasks finished.

=== Reflection on the Process ===

* What did you learn?
** With your head (biological or computer science principles)
*** I learned much more about the inner workings of database management and queries. I also learned quite a bit about the importance of microarray analyses and how they can be used to discover more about an organism like Shewanella oneidensis.
** With your heart (personal qualities and teamwork qualities that make things work or not work)?
*** I learned that a team that gets along well will be much more effective, because it is easier to take and give from one another in the sense that we are comfortable discussing and helping in a work environment. I found that having each team member have a specific role makes splitting up the work much easier, since everyone knows his or her role and responsibility.
** With your hands (technical skills)?
*** Like I mentioned earlier, I learned about SQL and how powerful databases can be when you know what you are doing. I also learned about XML files and how data can be efficiently stored and accessed using that type of formatting.
* What lesson will you take away from this project that you will still use a year from now?
** I will be taking a databases class next year, so I will definitely be using the knowledge I gained in this class regarding database management for that purpose. I will also use my introductory knowledge of wiki editing and formatting to possibly contribute to wikis in the future.

{{Template:Journal Template}}

Heavy Metal HaterZ Deliverables

2015-12-15T06:37:55Z

Jkuroda: /* Group Files and Datasets */ ppt

{{Template:Heavy Metal HaterZ}}

== Group Files and Datasets ==

===Gene Database===

* [[Media:So-Std_HMH_20151214.zip | GenMAPP Gene Database for assigned species]]
* ReadMe file to accompany the Gene Database (''.pdf'')
** [[Media:ReadMe So-Std 20151214.pdf | Shewanella onedensis ReadMe]]
* [[Media:Gene Database Testing Report - Heavy Metal HaterZ.pdf | Gene Database Testing Report]] for final submitted Gene Database
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* [[Media:CompiledRawData ForGenMAPP 20151208 HMH.txt | Data file]] used for import into GenMAPP
* [[Media:CompiledRawData ForGenMAPP 20151210 HMH.gex | GenMAPP Expression Dataset file]]
* [[Media:CompiledRawData ForGenMAPP 20151210 HMH.EX.txt | Exceptions file]] of data imported into GenMAPP
* Raw MAPPFinder results files:
** [[Media:MAPPFINDER 20151212-Criterion0-GO.txt | criterion 0]]
** [[Media:MAPPFINDER 20151212-Criterion1-GO.txt | criterion 1]]
* [[Media:CompiledRawData ForGenMAPP 20151210 HMH.zip | ''.gmf'' file]]
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
**[[Media:MAPPFINDER 20151212-Criterion0-GO.xlsx | criterion 0]]
**[[Media:MAPPFINDER 20151212-Criterion1-GO 20151212 HMH.xlsx | criterion 1]]
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* [[Media:Biol Data Final Presentation.pdf | PowerPoint presentation]] given on Tuesday, December 15

[[Gene Database Project Deliverables]]
[[Category:Heavy Metal HaterZ]]
[[Category:Group Projects]]

File:Biol Data Final Presentation.pdf

2015-12-15T06:36:57Z

Jkuroda: HEAVY METAL HATERZ!

HEAVY METAL HATERZ!

Heavy Metal HaterZ

2015-12-15T03:49:06Z

Jkuroda: /* Week 15 Assignment */ josh update

{{Heavy Metal HaterZ}}
==Week 15 Assignment==
===Goals===
*Finish Deliverables
*Prepare for the presentation
===Status Report===
*Mary - Finished second customization for the S.oneidensis profile. I need to push it to github with support from Dr.Dionisio.
*Josh - Uploaded most of our deliverables and completed any of the remaining tasks, e.g. ReadMe to accompany the Gene Database

==Week 14 Assignment==
===Goals===
*Coder/QA
**Analyze the initial exports and make any necessary changes to the custom species profile to capture all of the IDs for your species
*GenMAPP Users
**Finish statistical analysis of compiled microarray data
**Prepare file for GenMAPP
===Status Report===
*Mary- I finished customizing the genMappBuilder and uploaded it onto this wiki so htat Josh could test to see if it works, which it does for the most part. A few genes aren't picked up that are not in the MOD but present elsewhere.
* Josh- Completed the export using the customized GenMAPP Builder from Mary. Checked the .gdb file and everything checked out. Used GenMAPP to see if the gene ID links worked and they did. Made more progress on the Gene Database Testing Report.
*Emily - I worked on manipulating the data to import it into GenMAPP and while Ron and I initially had some problems, I think we worked them out and are ready to continue this coming week.
*Ron - Worked on manipulation of the data along with Emily to prepare data for GenMAPP. Difficulties arose when it came to performing particular calculations and having data match with Emily's, but I think that they have been resolved with the help of Dr. Dahlquist; thus, we are closer to having a GenMAPP ready file.

====Reflections====
#What worked?
#What didn't work?
#What will I do next to fix what didn't work?

*Mary:
*#Dr.Dionisio's instructions were very clear so I was able to customize the genMappBuilder and it is working like it should.
*#There are still 11 "lost" genes that may need to be "found" somehow with my code, even though they are not located within the MOD.
*#First I need to determine if those genes are necessary for genMapp to catch, and then i will need to re-customize the code if so.

*Josh:
*# Our customized export seemed to work, since we got the 4196 count for Ordered Locus Names for which we were looking.
*#The 11 IDs we found that did not have gene tags in the XML file are a small issue for us. None of them exist in our model organism database, but 8 of them are present in our microarray data.
*#We are waiting on input from Professor Dahlquist regarding our next steps with these 11 IDs. Once we find out, we will act accordingly and possibly edit our code.

*Emily
*#It was very helpful to have Dr. Dahlquist's instructions for manipulating our data, so I was able to calculate the Pvalues and adjust them using the two tests.
*#Ron and I had to do a lot of work late this week to make our data match, but we worked together well and were able to solve the problems we encountered.
*#I will redo the Pvalues and the two adjustment calculations. Then I will get the data ready for GenMAPP.

*Ron
*#Dr. Dahlquist's instructions and feedback helped with ensuring the manipulation of the data was done accurately.
*#It was difficult to calculate the averages from the split data since the equations wouldn't copy down the entire column due to blank spaces within the data; in addition, calculating the p-values was difficult. In addition, there was a lot of work to be done since issues were encountered with the data and the equations.
*#Hopefully, after all the feedback and instructions Emily and I will be able to do all the necessary calculations and statistical analysis to have a file ready for GenMAPP by the end of this week.

==Week 12 Assignment==
===Goals===
*Coder/QA
**Prepare for journal club presentation
**Perform an initial Gene Database export and Gene Database Testing Report
*GenMAPP Users
**Compile the raw data in preparation for normalization and statistical analysis.
===Status Report===
*Emily: uploaded and formatted all microarray files after they were repleted with ferrous sulfate
*Mary: Prepared for genome paper journal club presentation. I also pushed the code from github onto a computer in the lab, which included downloading eclipse and git for windows on the lab computer.
*Josh: Prepared for genome paper presentation with Mary. Completed the initial import/export cycle and made significant progress on the Gene Database Testing Report.
*Ron: Similar to Emily, downloaded the microarray raw data files, followed the procedure given by Dr. Dahlquist for data processing (I worked with the files related to iron depletion with the iron chelator), and I uploaded the files to the wiki.

==Week 11 Assignment : Journal Club Presentation==
===Presentation Slides===
*These can also be accessed by going to our [[Heavy Metal HaterZ Files | Files]] page.

*[[File:Genome_Paper_Presentation_20151124_HMH.pptx]]
*[[File:SoMicroarrayPaperPresentation 20151117 HMH.pptx]]

===Goals===
*Prepare for journal club presentations
*Begin initial tasks on your research project
**Coder/QA
***Set up coding/testing environment
***Determine the regular expression for the ordered locus ID for your species
***Identify the appropriate model organism database for your species.
***Perform an initial Gene Database export and Gene Database Testing Report
**GenMAPP Users
***Describe the experimental design of the microarray data, including treatments, number of replicates (biological and/or technical), dye swaps.
***Determine the sample and data relationships, i.e., which files in the data correspond to which samples in the experimental design.
***Compile the raw data in preparation for normalization and statistical analysis.
===Status Report===
*Emily: worked on journal club presentation and created flow chart diagrams for the experimental design
*Mary: Completed journal club presentation slides with Josh. I downloaded eclipse on my personal laptop, so along with the use of the lab computers my coding/testing environment should be set up. I determined with Josh the regular expression of the ordered locus ID for our species. I was not able to, however, perform an initial export yet.
*Ron: Completed journal club presentation slides with Emily and uploaded slides in HMH Files pages. [[Media:SoMicroarrayPaperPresentation 20151117 HMH.pptx | Link to Microarray Paper Presentation here.]] Looked over sample and data relationships file from ArrayExpress entry (E-GEOD-15334) and converted .txt file into .xlsx file. I have not been able to compile raw data with Emily, as we still need clarification on which files are to be used for statistical analysis.
*Josh: Completed the genome paper presentation with Mary and did more research on our organism. Haven't done an initial import/export cycle yet. Planning to complete that later this week.

==Week 10 Assignment : Annotated Bibliography==

===Our Genome Paper===

Heidelberg, J. F., Paulsen, I. T., Nelson, K. E., Gaidos, E. J., Nelson, W. C., Read, T. D., ... & Fraser, C. M. (2002). Genome sequence of the dissimilatory metal ion–reducing bacterium Shewanella oneidensis. ''Nature biotechnology, 20''(11), 1118-1123. doi:10.1038/nbt749
*The [http://www.ncbi.nlm.nih.gov/pubmed/?term=Genome+sequence+of+the+dissimilatory+metal+ion%E2%80%93reducing+bacterium+Shewanella+oneidensis abstract] from PubMed.
*The full text of the article in PubMedCentral : Not available.
*The [http://www.nature.com/nbt/journal/v20/n11/full/nbt749.html full text] of the article from the publisher web site. 
*The [http://www.nature.com/nbt/journal/v20/n11/pdf/nbt749.pdf full PDF version] of the article from the publisher web site.
*Who owns the rights to the article?
**The Nature Publishing Group, which is the publisher of this article, according to this [https://s100.copyright.com/AppDispatchServlet?publisherName=NPG&publication=Nature%20Biotechnology&title=Genome%20sequence%20of%20the%20dissimilatory%20metal%20ion-reducing%20bacterium%20Shewanella%20oneidensis&author=John%20F.%20Heidelberg,%20Ian%20T.%20Paulsen,%20Karen%20E.%20Nelson,%20Eric%20J.%20Gaidos,%20William%20C.%20Nelson%20et%20al.&contentID=10.1038/nbt749&publicationDate=10/07/2002&volumeNum=20&issueNum=11&numPages=6&pageNumbers=pp1118-1123 site].
*Do the authors own the rights under a Creative Commons license?
**Yes, according to this [http://oaspa.org/member/nature-publishing-group-palgrave-macmillan/ site].
*Is the article available “Open Access”?
**According to [http://oaspa.org/membership/members/ this site], the article is available "Open Access".
*What organization is the publisher of the article? What type of organization is it?
**According to the site above, this publisher is a "Professional OA Publisher (Large)".
*Is this article available in print or online only?
**Online only. It was published online in November, 2002.
*Has LMU paid a subscription or other fee for your access to this article?
**No.
*We performed a search in the ISI Web of Science/Knowledge database by typing in the title "Genome sequence of the dissimilatory metal ion–reducing bacterium Shewanella oneidensis" to the search bar.
**Three articles came up as results. The first two articles title's did not exactly match, and were cited under 15 times each. The third article was the article we were searching for.
*How many articles does this article cite?
**This article has 41 cited references within the Web of Science Core Collection, according to this [https://apps.webofknowledge.com/full_record.do?product=UA&search_mode=GeneralSearch&qid=3&SID=3Evs6J6HvCojNOHG6K3&page=1&doc=3 site].
*How many articles cite this article?
**It has been cited 1079 times in all databases, and 426 within the Web of Science Core Collection, according to this [https://apps.webofknowledge.com/full_record.do?product=UA&search_mode=GeneralSearch&qid=3&SID=3Evs6J6HvCojNOHG6K3&page=1&doc=3 site].
*Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced?
**Examples of titles that reference the genome paper:
***Environmental genome shotgun sequencing of the Sargasso Sea
***Deciphering the evolution and metabolism of an anammox bacterium from a community genome
***Genome of Geobacter sulfurreducens: Metal reduction in subsurface environments
***More can be found by clicking this [https://apps.webofknowledge.com/summary.do?product=WOS&parentProduct=UA&search_mode=CitingArticles&qid=8&SID=3Evs6J6HvCojNOHG6K3&page=1&action=sort&sortBy=LC.D;PY.D;AU.A.en;SO.A.en;VL.D;PG.A&showFirstPage=1 link].
**These papers include studying within in the species, finding out the genomes of other species, as well as the metabolic versatility of microorganisms and metal ion reduction in environments. This shows that a sequenced genome can aide in experiments of many kinds.

===Our Microarray Paper===
*Dataset can be found at this [https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-15334/?keywords=&organism=Shewanella+oneidensis&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array= link].


====E-GEOD-15334: Yang et. al (2009)====

This paper is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 09:41, 10 November 2015 (PST)''

Yang, Y., Harris, D. P., Luo, F., Xiong, W., Joachimiak, M., Wu, L., ... & Zhou, J. (2009). Snapshot of iron response in Shewanella oneidensis by gene network reconstruction. ''BMC genomics, 10''(1), 131.
*The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed/?term=Yang%2C+Y.%2C+Harris%2C+D.+P.%2C+Luo%2C+F.%2C+Xiong%2C+W.%2C+Joachimiak%2C+M.%2C+Wu%2C+L.%2C+...+%26+Zhou%2C+J.+%282009%29.+Snapshot+of+iron+response+in+Shewanella+oneidensis+by+gene+network+reconstruction.+BMC+genomics%2C+10%281%29%2C+131. PubMed].
*The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2667191/ PubMedCentral]
*The link to the full text of the article (HTML format) from the publisher [http://www.biomedcentral.com/1471-2164/10/131 web site].
*The link to the full [http://www.biomedcentral.com/content/pdf/1471-2164-10-131.pdf PDF] version of the article from the publisher web site.
*Who owns the rights to the article?
**The article is Open Access and the authors own the rights under a Creative Commons license.
*What organization is the publisher of the article? What type of organization is it?
**BMC Genomics is the publisher, which is a scientific society
*Is this article available in print or online only?
**It is online only
*Has LMU paid a subscription or other fee for your access to this article?
**No
*How many articles does this article cite?
**This paper sites 48 other articles
*How many articles cite this article?
**3
***Roles of UndA and MtrC of ''Shewanella putrefaciens'' W3-18-1 in iron reduction
***Global transcriptional response of ''Caulobacter crescentus'' to iron availability
***Molecular ecological network analysis
*Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced?
**This article has mostly been used to look at the iron response of other strains or organisms. It may have been used for comparison's sake or to modify the original methodology to fit the new experiment.
*Link to microarray data
**Found it on [https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-15334/ ArrayExpress]
**This contains the raw data that we will use for our research
*What experiment was performed? What was the "treatment" and what was the "control" in the experiment?
**Strains of ''Shewanella oneidensis'' were put under iron depletion and repletion conditions. The control would be a regular strain of the organism, while the treatments would be either increasing or decreasing the iron levels.
*Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each?
**4 biological replicates of each treatment condition were performed

Jkuroda Week 15

2015-12-13T00:41:57Z

Jkuroda: /* Log */ 12/12/15

Heavy Metal HaterZ Deliverables

2015-12-13T00:38:41Z

Jkuroda: /* Group Files and Datasets */ adding files

{{Template:Heavy Metal HaterZ}}

== Group Files and Datasets ==

* [[Media:So-Std 20151201special.zip | GenMAPP Gene Database for assigned species]]
* ReadMe file to accompany the Gene Database (''.pdf'')
** Shewanella onedensis ReadMe
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* [[Media:Gene Database Testing Report - Heavy Metal HaterZ.pdf | Gene Database Testing Report]] for final submitted Gene Database
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* [[Media:CompiledRawData ForGenMAPP 20151208 HMH.txt | Data file]] used for import into GenMAPP
* [[Media:CompiledRawData ForGenMAPP 20151210 HMH.gex | GenMAPP Expression Dataset file]]
* [[Media:CompiledRawData ForGenMAPP 20151210 HMH.EX.txt | Exceptions file]] of data imported into GenMAPP
* Raw MAPPFinder results files:
** [[Media:MAPPFINDER 20151212-Criterion0-GO.txt | criterion 0]]
** [[Media:MAPPFINDER 20151212-Criterion1-GO.txt | criterion 1]]
* [[Media:CompiledRawData ForGenMAPP 20151210 HMH.zip | ''.gmf'' file]]
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
**[[Media:MAPPFINDER 20151212-Criterion0-GO.xlsx | criterion 0]]
**[[Media:MAPPFINDER 20151212-Criterion1-GO 20151212 HMH.xlsx | criterion 1]]
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

[[Gene Database Project Deliverables]]
[[Category:Heavy Metal HaterZ]]
[[Category:Group Projects]]

File:MAPPFINDER 20151212-Criterion0-GO.xlsx

2015-12-13T00:36:26Z

Jkuroda:

File:CompiledRawData ForGenMAPP 20151210 HMH.zip

2015-12-13T00:35:26Z

Jkuroda: gmf

gmf

File:MAPPFINDER 20151212-Criterion1-GO.txt

2015-12-13T00:34:03Z

Jkuroda: heavy metal haterZ

heavy metal haterZ

File:MAPPFINDER 20151212-Criterion0-GO.txt

2015-12-13T00:33:16Z

Jkuroda: heavy metal haterZ

heavy metal haterZ

File:CompiledRawData ForGenMAPP 20151210 HMH.gex

2015-12-13T00:32:08Z

Jkuroda: heavy metal haterZ

heavy metal haterZ

Heavy Metal HaterZ Deliverables

2015-12-13T00:30:56Z

Jkuroda: /* Group Files and Datasets */ GenMAPP Expression Dataset file

{{Template:Heavy Metal HaterZ}}

== Group Files and Datasets ==

* [[Media:So-Std 20151201special.zip | GenMAPP Gene Database for assigned species]]
* ReadMe file to accompany the Gene Database (''.pdf'')
** Shewanella onedensis ReadMe
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* [[Media:Gene Database Testing Report - Heavy Metal HaterZ.pdf | Gene Database Testing Report]] for final submitted Gene Database
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* [[Media:CompiledRawData ForGenMAPP 20151208 HMH.txt | Data file]] used for import into GenMAPP
* [[Media:CompiledRawData ForGenMAPP 20151210 HMH.EX.txt | GenMAPP Expression Dataset file]]
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

[[Gene Database Project Deliverables]]
[[Category:Heavy Metal HaterZ]]
[[Category:Group Projects]]

File:CompiledRawData ForGenMAPP 20151210 HMH.EX.txt

2015-12-13T00:30:18Z

Jkuroda: heavy metal haterZ

heavy metal haterZ

Heavy Metal HaterZ Deliverables

2015-12-13T00:29:33Z

Jkuroda: /* Group Files and Datasets */

{{Template:Heavy Metal HaterZ}}

== Group Files and Datasets ==

* [[Media:So-Std 20151201special.zip | GenMAPP Gene Database for assigned species]]
* ReadMe file to accompany the Gene Database (''.pdf'')
** Shewanella onedensis ReadMe
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* [[Media:Gene Database Testing Report - Heavy Metal HaterZ.pdf | Gene Database Testing Report]] for final submitted Gene Database
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* [[Media:CompiledRawData ForGenMAPP 20151208 HMH.txt | Data file]] used for import into GenMAPP
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

[[Gene Database Project Deliverables]]
[[Category:Heavy Metal HaterZ]]
[[Category:Group Projects]]

Heavy Metal HaterZ Deliverables

2015-12-13T00:29:12Z

Jkuroda: /* Group Files and Datasets */ import file for GenMAPP

{{Template:Heavy Metal HaterZ}}

== Group Files and Datasets ==

* [[Media:So-Std 20151201special.zip | GenMAPP Gene Database for assigned species]]
* ReadMe file to accompany the Gene Database (''.pdf'')
** Shewanella onedensis ReadMe
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* [[Media:Gene Database Testing Report - Heavy Metal HaterZ.pdf | Gene Database Testing Report]] for final submitted Gene Database
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* [[CompiledRawData ForGenMAPP 20151208 HMH.txt | Data file]] used for import into GenMAPP
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

[[Gene Database Project Deliverables]]
[[Category:Heavy Metal HaterZ]]
[[Category:Group Projects]]

File:CompiledRawData ForGenMAPP 20151208 HMH.txt

2015-12-13T00:28:36Z

Jkuroda: Jkuroda uploaded a new version of File:CompiledRawData ForGenMAPP 20151208 HMH.txt

Heavy Metal HaterZ Deliverables

2015-12-13T00:25:40Z

Jkuroda: /* Group Files and Datasets */ minor

{{Template:Heavy Metal HaterZ}}

== Group Files and Datasets ==

* [[Media:So-Std 20151201special.zip | GenMAPP Gene Database for assigned species]]
* ReadMe file to accompany the Gene Database (''.pdf'')
** Shewanella onedensis ReadMe
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* [[Media:Gene Database Testing Report - Heavy Metal HaterZ.pdf | Gene Database Testing Report]] for final submitted Gene Database
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

[[Gene Database Project Deliverables]]
[[Category:Heavy Metal HaterZ]]
[[Category:Group Projects]]

Heavy Metal HaterZ Deliverables

2015-12-13T00:25:14Z

Jkuroda: /* Group Files and Datasets */ adding GDTR pdf

{{Template:Heavy Metal HaterZ}}

== Group Files and Datasets ==

* [[Media:So-Std 20151201special.zip | GenMAPP Gene Database for assigned species]]
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* [[Media:Gene Database Testing Report - Heavy Metal HaterZ.pdf | Gene Database Testing Report]] for final submitted Gene Database
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

[[Gene Database Project Deliverables]]
[[Category:Heavy Metal HaterZ]]
[[Category:Group Projects]]

File:Gene Database Testing Report - Heavy Metal HaterZ.pdf

2015-12-13T00:23:10Z

Jkuroda: Heavy Metal HaterZ

Heavy Metal HaterZ

Gene Database Testing Report - Heavy Metal HaterZ

2015-12-13T00:16:05Z

Jkuroda: /* Running MAPPFinder */ adding relevant files

{{Heavy Metal HaterZ}}
==Export Information==

Version of GenMAPP Builder: '''3 build 5'''

Computer on which export was run: '''HP Compaq 8300 Elite SFF FC'''

Postgres Database name: '''S. Oneidensis'''

UniProt XML filename: [[Media:SOneidensisUNIPROT.gz | '''SOneidensisUNIPROT''']]
* UniProt XML version: '''UniProt release 2015_10 - October 14, 2015'''
* [http://www.uniprot.org/uniprot/?query=taxonomy:211586 UniProt XML download link]
* Time taken to import: '''3.18 minutes'''
** Note: ''n/a''

GO OBO-XML filename: [[Media:Go daily-termdb.obo-xml.gz | '''go daily-termdb.obo-xml''']]
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped):
* [http://geneontology.org/page/download-ontology#Legacy_Downloads GO OBO-XML download link]
* Time taken to import: '''7.16 minutes'''
* Time taken to process: '''4.27 minutes'''
** Note: ''n/a''

GOA filename: [[Media:ShewanellaOneidensisGOA.zip | '''ShewanellaOneidensisGOA''']]
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''October 14, 2015 - GOA Proteome Sets 124'''
* [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/106.S_oneidensis.goa GOA download link]
* Time taken to import: '''0.05 minutes'''
** Note: ''n/a''

Name of .gdb file: [[Media:So-Std 20151119HMH.zip | '''So-Std 20151119HMH.gdb''']]
* Time taken to export: '''1 hour and 18 minutes'''
** Start time: '''3:48pm'''
** End time: '''5:06pm'''
** Note: ''n/a''

==TallyEngine==

* Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
** Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
** Take a screenshot of the results. Upload the image to the wiki and display it on this page.
** '''4196''' IDs
** For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]

[[Image:TallyEngineSOneidensis.PNG | center | 540px]]

== Using XMLPipeDB match to Validate the XML Results from the TallyEngine==

[[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | Follow the instructions found on this page to run XMLPipeDB match.]]

Are your results the same as you got for the TallyEngine? Why or why not?
* Initially, we got '''4196''' IDs for both XML and Postgres DB from TallyEngine but got '''4079''' IDs by using XMLPipeDB match. This result was from using the following command:
java -jar xmlpipedb-match-1.1.1.jar "SO_[0-9][0-9][0-9][0-9]" < SOneidensisUNIPROT

* After checking the .gdb file and looking through the Gene IDs, I found that some IDs were in the form "SO_A####" so I ran a new command accounting for this:
java -jar xmlpipedb-match-1.1.1.jar "SO_A?[0-9][0-9][0-9][0-9]" < SOneidensisUNIPROT
* This gave a total number of '''4207''' IDs.

== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==

For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]

You can also look for counts at the SQL level, using some variation of a ''select count(*)'' query. This requires some knowledge of which table received what data. Here’s an initial tip: the ''gene/name'' tags in the XML file land in the ''genenametype'' table. A query on this table counting values from this table that were marked as ''ordered locus'' in the XML file matching the pattern ''SO_[0-9][0-9][0-9][0-9]'' would look like this:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'SO_[0-9][0-9][0-9][0-9]';
* However, once I found that some IDs were in the form "SO_A####" I tweaked the pattern to account for those IDs:
select count(*) from genenametype where type = 'ordered locus' and value ~ 'SO_A?[0-9][0-9][0-9][0-9]';

In ''pgAdmin III'', you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the ''SQL Editor'' tab, then clicking on the green triangular ''Play'' button to run.

Are your results the same as reported by the TallyEngine? Why or why not?
* Initially, we got a count of '''4068''' IDs using SQL, which differed from the '''4196''' IDs from TallyEngine.
* After tweaking the pattern to account for IDs with that extra ''A'', we got a grand total of '''4196''' IDs, which matches with what TallyEngine gave us.

==OriginalRowCounts Comparison==

Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.

Benchmark .gdb file: [http://sourceforge.net/projects/xmlpipedb/files/V.%20cholerae%20Gene%20Database/V.%20cholerae%2020101022/Vc-Std_External_20101022.zip/download 2010 benchmark file]

Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:

[[Media:OriginalRowCounts.pdf | Original Row Counts Table]]

[[Media:OriginalRowCounts2010.pdf | 2010 Benchmark Original Row Counts Table]]

See Analysis section for more on the comparison and the discrepancy found because of this comparison.

Note: Using Microsoft Access, we found ''7664'' IDs, which was actually double the number of IDs present because of duplicated IDs that did not have an underscore.

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.

* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
** No. For the current version, a good number of gene ID systems in the database do not have a value for the date field. Some systems that lack a date include: GenBank, UniGene, WormBase, and EcoGene.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
** For the UniProt table, the IDs start with either ''Q8'' or ''K4'' and have some string of four letters/numbers trailing that. For the RefSeq tables, the IDs have forms that start with either ''NP_'' or ''WP_'', with the ''WP_'' forms having 9 numbers afterwards and the ''NP_'' forms having 6 numbers afterwards. For the OrderedLocusNames table, the IDs either start with <code>SO_</code> or <code>SO_A</code>.

Note: ''n/a''

==Analysis==

Consolidating the counts of gene IDs from the various methods, I got:

* 4196 IDs from Tally Engine
* ''4207'' IDs from xmlpipedb-match
java -jar xmlpipedb-match-1.1.1.jar "SO_A?[0-9][0-9][0-9][0-9]" < SOneidensisUNIPROT
* 4196 IDs from PostgreSQL
select count(*) from genenametype where type = 'ordered locus' and value ~ 'SO_A?[0-9][0-9][0-9][0-9]';
* 4196 IDs from Microsoft Access (Noted that there were 4068 IDs in the form <code>SO_####</code> and 128 IDs in the form <code>SO_A####</code>)

Notice that there is a small but significant discrepancy in that there seems to be eleven more IDs when using xmlpipedb-match. This is troubling because of the fact that the other three methods seemed to confirm a total count of 4196. So, I used Microsoft Excel to compare the list of gene IDs from the actual .gdb file and the list I got back from xmlpipedb-match. As you can see on [[Media:Comparing gdb and xmlpipedb-match.xlsx | this document]], there are 11 IDs in the xmlpipedb-match column that are not found in the gdb column. This discrepancy was further pointed out by the use of some <code>match</code> functions to see where an ID was missing from either list. Below are the two match functions I used in the document:
=MATCH(A2, B$2:B$4208, 0)
=MATCH(B2, A$2:A$4208, 0)
Below are the eleven IDs in question:
SO_3699 NO-KD
SO_1312 NO-KD
SO_4269 NO-KD if they are all part of "Protein-protein interaction databases", then you can safely leave them out
SO_2875 NO-MA
SO_4532 NO-MA
SO_4580 NO-MA
SO_2662 NO-MA
SO_4423 NO-MA
SO_3156 NO-MA
SO_2967 NO-MA
SO_2024 NO-MA
//I looked these up on www.uniprot.org and all were found to be only in "Protein-protein interaction databases"

Look up the IDs at [http://www.uniprot.org UniProt web site] and then search for them on the UniProt record web page. If they are part of the "STRING" protein-protein interaction database, then you can safely leave them out. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 15:55, 3 December 2015 (PST)''

None of these IDs are in our MOD.
We searched for these IDs in the microarray statistical analysis sheet and did not find the following IDs:
SO_4269, SO_2875, SO_4580

As of 12/01/15, we are waiting on input from Professor Dahlquist to see if we will adjust our GenMAPP Builder to account for these 11 IDs.
* A manual inspection was done on the [[Media:SOneidensisUNIPROT.gz | '''SOneidensisUNIPROT''']] XML file and it looks like these 11 IDs are contained within entries that are missing a gene tag, which explains why the other methods only picked up 4196 IDs.

==.gdb Use in GenMAPP==

While the above sections perform quality assurance on the exported Gene Database via verifying ID counts, the "proof in the pudding" is to actually use the Gene Database in GenMAPP. You can follow the instructions in [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] to verify that the Gene Database works in GenMAPP. In this case, the emphasis is not on the findings of the data analysis itself, but that the Gene Database functions appropriate in GenMAPP.

For assistance with using the GenMAPP program, the GenMAPP Help is very extensive. To access it within GenMAPP, go to the menu item Help > GenMAPP Help and either browse or search for your topic of interest.

Note: ''n/a''

===Putting a gene on the MAPP using the GeneFinder window===

* In the main GenMAPP Drafting Board window, left-click on the icon for "Gene" in the upper left corner of the window. Click on the Drafting Board to place the Gene on the MAPP. Now, right-click on the gene to access the GeneFinder window. Type or paste a gene ID into the Gene ID field. Select the appropriate Gene ID system from the drop-down menu and click the Search button. For example, for ''Vibrio cholerae'', you could search for the ID "VC0028", which is an OrderedLocusNames ID. Once the ID has been found, click the OK button to return to the Drafting Board window.
** For the Final Project, you will need to try a sample ID from each of the gene ID systems, not just OrderedLocusNames.
* Open the Backpage by left-clicking on the gene box on the Drafting Board to see if all of the cross-referenced IDs that are supposed to be there are there.

Note: I tried out the search for a gene ID and was able to bring up the Backpage for that ID. The cross-referenced IDs that were supposed to show up were indeed on the page.

===Creating an Expression Dataset in the Expression Dataset Manager===

* How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

Note: The Expression Dataset Manager reported that there were 1441 errors during the conversion. From looking over the error codes, I found that these genes were the ones we expected to ignore, like the IDs with an added 'F.'

===Coloring a MAPP with expression data===

Note: I was able to successfully color the MAPP by coloring the increased and decreased Log Fold Changes.

===Running MAPPFinder===

Note: After the results had been calculated, a Gene Ontology browser opened showing my results. All of the Gene Ontology terms that have at least 3 genes measured and a p value of less than 0.05 were highlighted yellow. A term with a p value less than 0.05 is considered a "significant" result. Browsed through the tree to see the results.

Documents produced from this run-through can be found here: [[Media:MAPPFinder Relevant Files.zip | Gene Database Testing docs]]

File:MAPPFinder Relevant Files.zip

2015-12-13T00:15:28Z

Jkuroda: heavy metal haterZ

heavy metal haterZ

Heavy Metal HaterZ Files

2015-12-12T23:51:32Z

Jkuroda: /* MAPPFinder Documents */ added file

==All Files==
*All files will be listed here.
*Appropriate way to title files:
**FileName_YYYYMMDD_HMH

===Initial Flow Chart===
*Initial flow chart for experimental design - [[File:Experimental Design Flow Chart 20151115 HMH.pptx]]

===Journal Club Presentation Power Points===
*Microarray Paper Presentation - [[File:SoMicroarrayPaperPresentation 20151117 HMH.pptx]]
*Genome Paper Presentation - [[File:GenomePPT_20151123_HMH.pdf]]

===Data Processing Notes from Dr. Dahlquist===
*Page 1 of Notes - [[Media:DrDDataProcessNotes1 20151119 HMH.JPG]]
*Page 2 of Notes - [[Media:DrDDataProcessNotes2 20151119 HMH.JPG]]

===GenMapp Builder===
*[[File:ShewanellaOneidensisGMBuilder_20151201_HMH.zip]]

===Statistical Analysis Excel Sheets===
*Prior to Splitting:
*#[[File:UpdatedCompiledRawData Shewanella RARL 20151201 HMH.xlsx]]
*#[[File:UpdatedCompiledRawData 20151206 HMH.xlsx]]
* After splitting, use this one:
*# [[Media:UpdatedCompiledRawData_Shewanella_RARL_20151201_HMH_forsplitting.xlsx | UpdatedCompiledRawData_Shewanella_RARL_20151201_HMH_forsplitting.xlsx]]
*Following the splitting, averages taken and ttests done to data sets:
*#[[File:StatisticalAnalysis Shewanella RARL 20151207 HMH.xlsx]]
*#[[File:UpdatedCompiledRawData Shewanella RARL 20151201 ES HMH forsplitting.xlsx]]
*#* [[Media:UpdatedCompiledRawData_Shewanella_RARL_20151201_ES_HMH_forsplitting_KD.xlsx | UpdatedCompiledRawData_Shewanella_RARL_20151201_ES_HMH_forsplitting_KD.xlsx]]
*#* I made corrections to Emily's file because there are 5408 genes, not 5408. I think that both files have the same results not and you can move on to the next step. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 15:22, 9 December 2015 (PST)''
*Ready for GenMAPP
**[[File:UpdatedCompiledRawData WithGenMAPP 20151210 ES HMH.xlsx]]
*.txt file for GenMAPP
**[[File:CompiledRawData ForGenMAPP 20151208 HMH.txt]]
* Ranked List from MAPPFinder: [[File:Ranked list from MAPPFinder.PNG]]

===Sanity Check Table===
*[[File:Sanity Check Chart 20151212 HMH.xlsx]]

===MAPPFinder Documents===
*gdb file - [[File:So-Std 20151201special.zip]]
*[[File:ColorSetforExpressionData F60C60 20151210 HMH.gex]]
*GenMAPP with all comparisons - [[File:FilesForComparison AllTrials WithGenMAPP 20151212 ES HMH.gex]]
*Filtered GO terms for increased - [[File:MAPPFINDER 20151212-Criterion0-GO 20151212 HMH.xlsx]]
*Filtered GO terms for decreased - [[File:MAPPFINDER 20151212-Criterion1-GO 20151212 HMH.xlsx]]
*Screenshots showing significant results - [[File:CompiledScreenShots 20151212 HMH.docx]]

[[Category:Heavy Metal HaterZ]]
[[Category:Group Projects]]

File:MAPPFINDER 20151212-Criterion1-GO 20151212 HMH.xlsx

2015-12-12T23:50:52Z

Jkuroda:

Heavy Metal HaterZ Files

2015-12-12T23:25:10Z

Jkuroda: /* MAPPFinder Documents */ added gdb

==All Files==
*All files will be listed here.
*Appropriate way to title files:
**FileName_YYYYMMDD_HMH

===Initial Flow Chart===
*Initial flow chart for experimental design - [[File:Experimental Design Flow Chart 20151115 HMH.pptx]]

===Journal Club Presentation Power Points===
*Microarray Paper Presentation - [[File:SoMicroarrayPaperPresentation 20151117 HMH.pptx]]
*Genome Paper Presentation - [[File:GenomePPT_20151123_HMH.pdf]]

===Data Processing Notes from Dr. Dahlquist===
*Page 1 of Notes - [[Media:DrDDataProcessNotes1 20151119 HMH.JPG]]
*Page 2 of Notes - [[Media:DrDDataProcessNotes2 20151119 HMH.JPG]]

===GenMapp Builder===
*[[File:ShewanellaOneidensisGMBuilder_20151201_HMH.zip]]

===Statistical Analysis Excel Sheets===
*Prior to Splitting:
*#[[File:UpdatedCompiledRawData Shewanella RARL 20151201 HMH.xlsx]]
*#[[File:UpdatedCompiledRawData 20151206 HMH.xlsx]]
* After splitting, use this one:
*# [[Media:UpdatedCompiledRawData_Shewanella_RARL_20151201_HMH_forsplitting.xlsx | UpdatedCompiledRawData_Shewanella_RARL_20151201_HMH_forsplitting.xlsx]]
*Following the splitting, averages taken and ttests done to data sets:
*#[[File:StatisticalAnalysis Shewanella RARL 20151207 HMH.xlsx]]
*#[[File:UpdatedCompiledRawData Shewanella RARL 20151201 ES HMH forsplitting.xlsx]]
*#* [[Media:UpdatedCompiledRawData_Shewanella_RARL_20151201_ES_HMH_forsplitting_KD.xlsx | UpdatedCompiledRawData_Shewanella_RARL_20151201_ES_HMH_forsplitting_KD.xlsx]]
*#* I made corrections to Emily's file because there are 5408 genes, not 5408. I think that both files have the same results not and you can move on to the next step. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 15:22, 9 December 2015 (PST)''
*Ready for GenMAPP
**[[File:UpdatedCompiledRawData WithGenMAPP 20151210 ES HMH.xlsx]]
*.txt file for GenMAPP
**[[File:CompiledRawData ForGenMAPP 20151208 HMH.txt]]
* Ranked List from MAPPFinder: [[File:Ranked list from MAPPFinder.PNG]]

===Sanity Check Table===
*[[File:Sanity Check Chart 20151212 HMH.xlsx]]

===MAPPFinder Documents===
*gdb file - [[File:So-Std 20151201special.zip]]
*[[File:ColorSetforExpressionData F60C60 20151210 HMH.gex]]
*Filtered GO terms - [[File:MAPPFINDER 20151212-Criterion0-GO 20151212 HMH.xlsx]]
*Screenshots showing significant results - [[File:CompiledScreenShots 20151212 HMH.docx]]

[[Category:Heavy Metal HaterZ]]
[[Category:Group Projects]]

Heavy Metal HaterZ Files

2015-12-12T23:20:31Z

Jkuroda: /* MAPPFinder Stuff */ added file

==All Files==
*All files will be listed here.
*Appropriate way to title files:
**FileName_YYYYMMDD_HMH

===Initial Flow Chart===
*Initial flow chart for experimental design - [[File:Experimental Design Flow Chart 20151115 HMH.pptx]]

===Journal Club Presentation Power Points===
*Microarray Paper Presentation - [[File:SoMicroarrayPaperPresentation 20151117 HMH.pptx]]
*Genome Paper Presentation - [[File:GenomePPT_20151123_HMH.pdf]]

===Data Processing Notes from Dr. Dahlquist===
*Page 1 of Notes - [[Media:DrDDataProcessNotes1 20151119 HMH.JPG]]
*Page 2 of Notes - [[Media:DrDDataProcessNotes2 20151119 HMH.JPG]]

===GenMapp Builder===
*[[File:ShewanellaOneidensisGMBuilder_20151201_HMH.zip]]

===Statistical Analysis Excel Sheets===
*Prior to Splitting:
*#[[File:UpdatedCompiledRawData Shewanella RARL 20151201 HMH.xlsx]]
*#[[File:UpdatedCompiledRawData 20151206 HMH.xlsx]]
* After splitting, use this one:
*# [[Media:UpdatedCompiledRawData_Shewanella_RARL_20151201_HMH_forsplitting.xlsx | UpdatedCompiledRawData_Shewanella_RARL_20151201_HMH_forsplitting.xlsx]]
*Following the splitting, averages taken and ttests done to data sets:
*#[[File:StatisticalAnalysis Shewanella RARL 20151207 HMH.xlsx]]
*#[[File:UpdatedCompiledRawData Shewanella RARL 20151201 ES HMH forsplitting.xlsx]]
*#* [[Media:UpdatedCompiledRawData_Shewanella_RARL_20151201_ES_HMH_forsplitting_KD.xlsx | UpdatedCompiledRawData_Shewanella_RARL_20151201_ES_HMH_forsplitting_KD.xlsx]]
*#* I made corrections to Emily's file because there are 5408 genes, not 5408. I think that both files have the same results not and you can move on to the next step. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 15:22, 9 December 2015 (PST)''
*Ready for GenMAPP
**[[File:UpdatedCompiledRawData WithGenMAPP 20151210 ES HMH.xlsx]]
*.txt file for GenMAPP
**[[File:CompiledRawData ForGenMAPP 20151208 HMH.txt]]
* Ranked List from MAPPFinder: [[File:Ranked list from MAPPFinder.PNG]]

===Sanity Check Table===
*[[File:Sanity Check Chart 20151212 HMH.xlsx]]

===MAPPFinder Documents===
*[[File:ColorSetforExpressionData F60C60 20151210 HMH.gex]]
*Filtered GO terms - [[File:MAPPFINDER 20151212-Criterion0-GO 20151212 HMH.xlsx]]
*Screenshots showing significant results - [[File:CompiledScreenShots 20151212 HMH.docx]]

[[Category:Heavy Metal HaterZ]]
[[Category:Group Projects]]

File:ColorSetforExpressionData F60C60 20151210 HMH.gex

2015-12-12T23:20:06Z

Jkuroda:

Heavy Metal HaterZ Files

2015-12-12T23:04:41Z

Jkuroda: added files

==All Files==
*All files will be listed here.
*Appropriate way to title files:
**FileName_YYYYMMDD_HMH

===Initial Flow Chart===
*Initial flow chart for experimental design - [[File:Experimental Design Flow Chart 20151115 HMH.pptx]]

===Journal Club Presentation Power Points===
*Microarray Paper Presentation - [[File:SoMicroarrayPaperPresentation 20151117 HMH.pptx]]
*Genome Paper Presentation - [[File:GenomePPT_20151123_HMH.pdf]]

===Data Processing Notes from Dr. Dahlquist===
*Page 1 of Notes - [[Media:DrDDataProcessNotes1 20151119 HMH.JPG]]
*Page 2 of Notes - [[Media:DrDDataProcessNotes2 20151119 HMH.JPG]]

===GenMapp Builder===
*[[File:ShewanellaOneidensisGMBuilder_20151201_HMH.zip]]

===Statistical Analysis Excel Sheets===
*Prior to Splitting:
*#[[File:UpdatedCompiledRawData Shewanella RARL 20151201 HMH.xlsx]]
*#[[File:UpdatedCompiledRawData 20151206 HMH.xlsx]]
* After splitting, use this one:
*# [[Media:UpdatedCompiledRawData_Shewanella_RARL_20151201_HMH_forsplitting.xlsx | UpdatedCompiledRawData_Shewanella_RARL_20151201_HMH_forsplitting.xlsx]]
*Following the splitting, averages taken and ttests done to data sets:
*#[[File:StatisticalAnalysis Shewanella RARL 20151207 HMH.xlsx]]
*#[[File:UpdatedCompiledRawData Shewanella RARL 20151201 ES HMH forsplitting.xlsx]]
*#* [[Media:UpdatedCompiledRawData_Shewanella_RARL_20151201_ES_HMH_forsplitting_KD.xlsx | UpdatedCompiledRawData_Shewanella_RARL_20151201_ES_HMH_forsplitting_KD.xlsx]]
*#* I made corrections to Emily's file because there are 5408 genes, not 5408. I think that both files have the same results not and you can move on to the next step. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 15:22, 9 December 2015 (PST)''
*Ready for GenMAPP
**[[File:UpdatedCompiledRawData WithGenMAPP 20151210 ES HMH.xlsx]]
*.txt file for GenMAPP
**[[File:CompiledRawData ForGenMAPP 20151208 HMH.txt]]
* Ranked List from MAPPFinder: [[File:Ranked list from MAPPFinder.PNG]]

===Sanity Check Table===
*[[File:Sanity Check Chart 20151212 HMH.xlsx]]

===MAPPFinder Stuff===
*Filtered GO terms - [[File:MAPPFINDER 20151212-Criterion0-GO 20151212 HMH.xlsx]]
*Screenshots showing significant results - [[File:CompiledScreenShots 20151212 HMH.docx]]

[[Category:Heavy Metal HaterZ]]
[[Category:Group Projects]]

File:CompiledScreenShots 20151212 HMH.docx

2015-12-12T23:04:19Z

Jkuroda:

File:MAPPFINDER 20151212-Criterion0-GO 20151212 HMH.xlsx

2015-12-12T23:00:23Z

Jkuroda:

Heavy Metal HaterZ Files

2015-12-12T22:29:36Z

Jkuroda: ranked list upload image

==All Files==
*All files will be listed here.
*Appropriate way to title files:
**FileName_YYYYMMDD_HMH

===Initial Flow Chart===
*Initial flow chart for experimental design - [[File:Experimental Design Flow Chart 20151115 HMH.pptx]]

===Journal Club Presentation Power Points===
*Microarray Paper Presentation - [[File:SoMicroarrayPaperPresentation 20151117 HMH.pptx]]
*Genome Paper Presentation - [[File:GenomePPT_20151123_HMH.pdf]]

===Data Processing Notes from Dr. Dahlquist===
*Page 1 of Notes - [[Media:DrDDataProcessNotes1 20151119 HMH.JPG]]
*Page 2 of Notes - [[Media:DrDDataProcessNotes2 20151119 HMH.JPG]]

===GenMapp Builder===
*[[File:ShewanellaOneidensisGMBuilder_20151201_HMH.zip]]

===Statistical Analysis Excel Sheets===
*Prior to Splitting:
*#[[File:UpdatedCompiledRawData Shewanella RARL 20151201 HMH.xlsx]]
*#[[File:UpdatedCompiledRawData 20151206 HMH.xlsx]]
* After splitting, use this one:
*# [[Media:UpdatedCompiledRawData_Shewanella_RARL_20151201_HMH_forsplitting.xlsx | UpdatedCompiledRawData_Shewanella_RARL_20151201_HMH_forsplitting.xlsx]]
*Following the splitting, averages taken and ttests done to data sets:
*#[[File:StatisticalAnalysis Shewanella RARL 20151207 HMH.xlsx]]
*#[[File:UpdatedCompiledRawData Shewanella RARL 20151201 ES HMH forsplitting.xlsx]]
*#* [[Media:UpdatedCompiledRawData_Shewanella_RARL_20151201_ES_HMH_forsplitting_KD.xlsx | UpdatedCompiledRawData_Shewanella_RARL_20151201_ES_HMH_forsplitting_KD.xlsx]]
*#* I made corrections to Emily's file because there are 5408 genes, not 5408. I think that both files have the same results not and you can move on to the next step. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 15:22, 9 December 2015 (PST)''
*Ready for GenMAPP
**[[File:UpdatedCompiledRawData WithGenMAPP 20151210 ES HMH.xlsx]]
*.txt file for GenMAPP
**[[File:CompiledRawData ForGenMAPP 20151208 HMH.txt]]
* Ranked List from MAPPFinder: [[File:Ranked list from MAPPFinder.PNG]]

===Sanity Check Table===

[[Category:Heavy Metal HaterZ]]
[[Category:Group Projects]]

File:Ranked list from MAPPFinder.PNG

2015-12-12T22:28:32Z

Jkuroda: HEAVY METAL HATERZ

HEAVY METAL HATERZ

Gene Database Testing Report - Heavy Metal HaterZ

2015-12-12T22:18:42Z

Jkuroda: /* Creating an Expression Dataset in the Expression Dataset Manager */ expression dataset

Jkuroda Week 15

2015-12-12T22:13:53Z

Jkuroda: 12/10/2015

Jkuroda Week 15

2015-12-08T23:59:46Z

Jkuroda: /* Log */ log from class 12/08/15

Jkuroda Week 15

2015-12-08T22:40:43Z

Jkuroda: log

==Log==
==Individual Assessment & Reflection==
=== Statement of Work ===

* Describe exactly what you did on the project.
* Provide references or links to artifacts of your work, such as:
** Wiki pages
** Other files or documents
** Code or scripts

=== Assessment of Project ===

* Give an objective assessment of the success of your project workflow and teamwork.
* What worked and what didn't work?
* What would you do differently if you could do it all over again?
* Evaluate the Gene Database Project and Group Report in the following areas:
*# Content: What is the quality of the work?
*# Organization: Comment on the organization of the project and of your group's wiki pages.
*# Completeness: Did your team achieve all of the project objectives? Why or why not?

=== Reflection on the Process ===

* What did you learn?
** With your head (biological or computer science principles)
** With your heart (personal qualities and teamwork qualities that make things work or not work)?
** With your hands (technical skills)?
* What lesson will you take away from this project that you will still use a year from now?

{{Template:Journal Template}}

Template:Heavy Metal HaterZ

2015-12-03T23:46:24Z

Jkuroda: /* Shewanella oneidensis */ GDTR

[[Image:HeavyMetal.jpg | 200px | right]]
==''Shewanella oneidensis''==
'''[[Gene Database Testing Report - Heavy Metal HaterZ]]'''

==Group Members==
*Coder:[[User:Malverso | Mary Alverson]]
*GenMAPP User & Project Manager:[[User:Rlegaspi | Ron Legaspi]]
*Quality Assurance:[[User:Jkuroda | Josh Kuroda]]
*GenMAPP User:[[User:Emilysimso | Emily Simso]]
==Important Links==
====[[Heavy Metal HaterZ Files | Our Files]]====
====[[Heavy Metal HaterZ Deliverables | Our Deliverables]]====
{{Gene_Database_Project_Links}}
{{HMH Individual Journal Entries}}

[[Category:Group Projects]]
[[Category:Heavy Metal HaterZ]]

Heavy Metal HaterZ

2015-12-03T23:39:34Z

Jkuroda: /* Week 14 Assignment */ josh2

{{Heavy Metal HaterZ}}
==Week 14 Assignment==
===Goals===
*Coder/QA
**Analyze the initial exports and make any necessary changes to the custom species profile to capture all of the IDs for your species
*GenMAPP Users
**Finish statistical analysis of compiled microarray data
**Prepare file for GenMAPP
===Status Report===
*Mary-
* Josh- Completed the export using the customized GenMAPP Builder from Mary. Checked the .gdb file and everything checked out. Used GenMAPP to see if the gene ID links worked and they did. Made more progress on the Gene Database Testing Report.

====Reflections====
#What worked?
#What didn't work?
#What will I do next to fix what didn't work?

==== Josh's Reflection ====

# What worked?
#* Our customized export seemed to work, since we got the 4196 count for Ordered Locus Names for which we were looking.
# What didn't work?
#* The 11 IDs we found that did not have gene tags in the XML file are a small issue for us. None of them exist in our model organism database, but 8 of them are present in our microarray data.
# What will I do next to fix what didn't work?
#* We are waiting on input from Professor Dahlquist regarding our next steps with these 11 IDs. Once we find out, we will act accordingly and possibly edit our code.

==Week 12 Assignment==
===Goals===
*Coder/QA
**Prepare for journal club presentation
**Perform an initial Gene Database export and Gene Database Testing Report
*GenMAPP Users
**Compile the raw data in preparation for normalization and statistical analysis.
===Status Report===
*Emily: uploaded and formatted all microarray files after they were repleted with ferrous sulfate
*Mary: Prepared for genome paper journal club presentation. I also pushed the code from github onto a computer in the lab, which included downloading eclipse and git for windows on the lab computer.
*Josh: Prepared for genome paper presentation with Mary. Completed the initial import/export cycle and made significant progress on the Gene Database Testing Report.
*Ron: Similar to Emily, downloaded the microarray raw data files, followed the procedure given by Dr. Dahlquist for data processing (I worked with the files related to iron depletion with the iron chelator), and I uploaded the files to the wiki.

==Week 11 Assignment : Journal Club Presentation==
===Presentation Slides===
*These can also be accessed by going to our [[Heavy Metal HaterZ Files | Files]] page.

*[[File:Genome_Paper_Presentation_20151124_HMH.pptx]]
*[[File:SoMicroarrayPaperPresentation 20151117 HMH.pptx]]

===Goals===
*Prepare for journal club presentations
*Begin initial tasks on your research project
**Coder/QA
***Set up coding/testing environment
***Determine the regular expression for the ordered locus ID for your species
***Identify the appropriate model organism database for your species.
***Perform an initial Gene Database export and Gene Database Testing Report
**GenMAPP Users
***Describe the experimental design of the microarray data, including treatments, number of replicates (biological and/or technical), dye swaps.
***Determine the sample and data relationships, i.e., which files in the data correspond to which samples in the experimental design.
***Compile the raw data in preparation for normalization and statistical analysis.
===Status Report===
*Emily: worked on journal club presentation and created flow chart diagrams for the experimental design
*Mary: Completed journal club presentation slides with Josh. I downloaded eclipse on my personal laptop, so along with the use of the lab computers my coding/testing environment should be set up. I determined with Josh the regular expression of the ordered locus ID for our species. I was not able to, however, perform an initial export yet.
*Ron: Completed journal club presentation slides with Emily and uploaded slides in HMH Files pages. [[Media:SoMicroarrayPaperPresentation 20151117 HMH.pptx | Link to Microarray Paper Presentation here.]] Looked over sample and data relationships file from ArrayExpress entry (E-GEOD-15334) and converted .txt file into .xlsx file. I have not been able to compile raw data with Emily, as we still need clarification on which files are to be used for statistical analysis.
*Josh: Completed the genome paper presentation with Mary and did more research on our organism. Haven't done an initial import/export cycle yet. Planning to complete that later this week.

==Week 10 Assignment : Annotated Bibliography==

===Our Genome Paper===

Heidelberg, J. F., Paulsen, I. T., Nelson, K. E., Gaidos, E. J., Nelson, W. C., Read, T. D., ... & Fraser, C. M. (2002). Genome sequence of the dissimilatory metal ion–reducing bacterium Shewanella oneidensis. ''Nature biotechnology, 20''(11), 1118-1123. doi:10.1038/nbt749
*The [http://www.ncbi.nlm.nih.gov/pubmed/?term=Genome+sequence+of+the+dissimilatory+metal+ion%E2%80%93reducing+bacterium+Shewanella+oneidensis abstract] from PubMed.
*The full text of the article in PubMedCentral : Not available.
*The [http://www.nature.com/nbt/journal/v20/n11/full/nbt749.html full text] of the article from the publisher web site. 
*The [http://www.nature.com/nbt/journal/v20/n11/pdf/nbt749.pdf full PDF version] of the article from the publisher web site.
*Who owns the rights to the article?
**The Nature Publishing Group, which is the publisher of this article, according to this [https://s100.copyright.com/AppDispatchServlet?publisherName=NPG&publication=Nature%20Biotechnology&title=Genome%20sequence%20of%20the%20dissimilatory%20metal%20ion-reducing%20bacterium%20Shewanella%20oneidensis&author=John%20F.%20Heidelberg,%20Ian%20T.%20Paulsen,%20Karen%20E.%20Nelson,%20Eric%20J.%20Gaidos,%20William%20C.%20Nelson%20et%20al.&contentID=10.1038/nbt749&publicationDate=10/07/2002&volumeNum=20&issueNum=11&numPages=6&pageNumbers=pp1118-1123 site].
*Do the authors own the rights under a Creative Commons license?
**Yes, according to this [http://oaspa.org/member/nature-publishing-group-palgrave-macmillan/ site].
*Is the article available “Open Access”?
**According to [http://oaspa.org/membership/members/ this site], the article is available "Open Access".
*What organization is the publisher of the article? What type of organization is it?
**According to the site above, this publisher is a "Professional OA Publisher (Large)".
*Is this article available in print or online only?
**Online only. It was published online in November, 2002.
*Has LMU paid a subscription or other fee for your access to this article?
**No.
*We performed a search in the ISI Web of Science/Knowledge database by typing in the title "Genome sequence of the dissimilatory metal ion–reducing bacterium Shewanella oneidensis" to the search bar.
**Three articles came up as results. The first two articles title's did not exactly match, and were cited under 15 times each. The third article was the article we were searching for.
*How many articles does this article cite?
**This article has 41 cited references within the Web of Science Core Collection, according to this [https://apps.webofknowledge.com/full_record.do?product=UA&search_mode=GeneralSearch&qid=3&SID=3Evs6J6HvCojNOHG6K3&page=1&doc=3 site].
*How many articles cite this article?
**It has been cited 1079 times in all databases, and 426 within the Web of Science Core Collection, according to this [https://apps.webofknowledge.com/full_record.do?product=UA&search_mode=GeneralSearch&qid=3&SID=3Evs6J6HvCojNOHG6K3&page=1&doc=3 site].
*Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced?
**Examples of titles that reference the genome paper:
***Environmental genome shotgun sequencing of the Sargasso Sea
***Deciphering the evolution and metabolism of an anammox bacterium from a community genome
***Genome of Geobacter sulfurreducens: Metal reduction in subsurface environments
***More can be found by clicking this [https://apps.webofknowledge.com/summary.do?product=WOS&parentProduct=UA&search_mode=CitingArticles&qid=8&SID=3Evs6J6HvCojNOHG6K3&page=1&action=sort&sortBy=LC.D;PY.D;AU.A.en;SO.A.en;VL.D;PG.A&showFirstPage=1 link].
**These papers include studying within in the species, finding out the genomes of other species, as well as the metabolic versatility of microorganisms and metal ion reduction in environments. This shows that a sequenced genome can aide in experiments of many kinds.

===Our Microarray Paper===
*Dataset can be found at this [https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-15334/?keywords=&organism=Shewanella+oneidensis&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array= link].


====E-GEOD-15334: Yang et. al (2009)====

This paper is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 09:41, 10 November 2015 (PST)''

Yang, Y., Harris, D. P., Luo, F., Xiong, W., Joachimiak, M., Wu, L., ... & Zhou, J. (2009). Snapshot of iron response in Shewanella oneidensis by gene network reconstruction. ''BMC genomics, 10''(1), 131.
*The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed/?term=Yang%2C+Y.%2C+Harris%2C+D.+P.%2C+Luo%2C+F.%2C+Xiong%2C+W.%2C+Joachimiak%2C+M.%2C+Wu%2C+L.%2C+...+%26+Zhou%2C+J.+%282009%29.+Snapshot+of+iron+response+in+Shewanella+oneidensis+by+gene+network+reconstruction.+BMC+genomics%2C+10%281%29%2C+131. PubMed].
*The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2667191/ PubMedCentral]
*The link to the full text of the article (HTML format) from the publisher [http://www.biomedcentral.com/1471-2164/10/131 web site].
*The link to the full [http://www.biomedcentral.com/content/pdf/1471-2164-10-131.pdf PDF] version of the article from the publisher web site.
*Who owns the rights to the article?
**The article is Open Access and the authors own the rights under a Creative Commons license.
*What organization is the publisher of the article? What type of organization is it?
**BMC Genomics is the publisher, which is a scientific society
*Is this article available in print or online only?
**It is online only
*Has LMU paid a subscription or other fee for your access to this article?
**No
*How many articles does this article cite?
**This paper sites 48 other articles
*How many articles cite this article?
**3
***Roles of UndA and MtrC of ''Shewanella putrefaciens'' W3-18-1 in iron reduction
***Global transcriptional response of ''Caulobacter crescentus'' to iron availability
***Molecular ecological network analysis
*Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced?
**This article has mostly been used to look at the iron response of other strains or organisms. It may have been used for comparison's sake or to modify the original methodology to fit the new experiment.
*Link to microarray data
**Found it on [https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-15334/ ArrayExpress]
**This contains the raw data that we will use for our research
*What experiment was performed? What was the "treatment" and what was the "control" in the experiment?
**Strains of ''Shewanella oneidensis'' were put under iron depletion and repletion conditions. The control would be a regular strain of the organism, while the treatments would be either increasing or decreasing the iron levels.
*Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each?
**4 biological replicates of each treatment condition were performed

Heavy Metal HaterZ

2015-12-03T23:33:55Z

Jkuroda: /* Week 14 Assignment */ josh

{{Heavy Metal HaterZ}}
==Week 14 Assignment==
===Goals===
*Coder/QA
**Analyze the initial exports and make any necessary changes to the custom species profile to capture all of the IDs for your species
*GenMAPP Users
**Finish statistical analysis of compiled microarray data
**Prepare file for GenMAPP
===Status Report===
*Mary-

====Reflections====
#What worked?
#What didn't work?
#What will I do next to fix what didn't work?

* Josh

==== Reflection ====

# What worked?
# What didn't work?
# What will I do next to fix what didn't work?

==Week 12 Assignment==
===Goals===
*Coder/QA
**Prepare for journal club presentation
**Perform an initial Gene Database export and Gene Database Testing Report
*GenMAPP Users
**Compile the raw data in preparation for normalization and statistical analysis.
===Status Report===
*Emily: uploaded and formatted all microarray files after they were repleted with ferrous sulfate
*Mary: Prepared for genome paper journal club presentation. I also pushed the code from github onto a computer in the lab, which included downloading eclipse and git for windows on the lab computer.
*Josh: Prepared for genome paper presentation with Mary. Completed the initial import/export cycle and made significant progress on the Gene Database Testing Report.
*Ron: Similar to Emily, downloaded the microarray raw data files, followed the procedure given by Dr. Dahlquist for data processing (I worked with the files related to iron depletion with the iron chelator), and I uploaded the files to the wiki.

==Week 11 Assignment : Journal Club Presentation==
===Presentation Slides===
*These can also be accessed by going to our [[Heavy Metal HaterZ Files | Files]] page.

*[[File:Genome_Paper_Presentation_20151124_HMH.pptx]]
*[[File:SoMicroarrayPaperPresentation 20151117 HMH.pptx]]

===Goals===
*Prepare for journal club presentations
*Begin initial tasks on your research project
**Coder/QA
***Set up coding/testing environment
***Determine the regular expression for the ordered locus ID for your species
***Identify the appropriate model organism database for your species.
***Perform an initial Gene Database export and Gene Database Testing Report
**GenMAPP Users
***Describe the experimental design of the microarray data, including treatments, number of replicates (biological and/or technical), dye swaps.
***Determine the sample and data relationships, i.e., which files in the data correspond to which samples in the experimental design.
***Compile the raw data in preparation for normalization and statistical analysis.
===Status Report===
*Emily: worked on journal club presentation and created flow chart diagrams for the experimental design
*Mary: Completed journal club presentation slides with Josh. I downloaded eclipse on my personal laptop, so along with the use of the lab computers my coding/testing environment should be set up. I determined with Josh the regular expression of the ordered locus ID for our species. I was not able to, however, perform an initial export yet.
*Ron: Completed journal club presentation slides with Emily and uploaded slides in HMH Files pages. [[Media:SoMicroarrayPaperPresentation 20151117 HMH.pptx | Link to Microarray Paper Presentation here.]] Looked over sample and data relationships file from ArrayExpress entry (E-GEOD-15334) and converted .txt file into .xlsx file. I have not been able to compile raw data with Emily, as we still need clarification on which files are to be used for statistical analysis.
*Josh: Completed the genome paper presentation with Mary and did more research on our organism. Haven't done an initial import/export cycle yet. Planning to complete that later this week.

==Week 10 Assignment : Annotated Bibliography==

===Our Genome Paper===

Heidelberg, J. F., Paulsen, I. T., Nelson, K. E., Gaidos, E. J., Nelson, W. C., Read, T. D., ... & Fraser, C. M. (2002). Genome sequence of the dissimilatory metal ion–reducing bacterium Shewanella oneidensis. ''Nature biotechnology, 20''(11), 1118-1123. doi:10.1038/nbt749
*The [http://www.ncbi.nlm.nih.gov/pubmed/?term=Genome+sequence+of+the+dissimilatory+metal+ion%E2%80%93reducing+bacterium+Shewanella+oneidensis abstract] from PubMed.
*The full text of the article in PubMedCentral : Not available.
*The [http://www.nature.com/nbt/journal/v20/n11/full/nbt749.html full text] of the article from the publisher web site. 
*The [http://www.nature.com/nbt/journal/v20/n11/pdf/nbt749.pdf full PDF version] of the article from the publisher web site.
*Who owns the rights to the article?
**The Nature Publishing Group, which is the publisher of this article, according to this [https://s100.copyright.com/AppDispatchServlet?publisherName=NPG&publication=Nature%20Biotechnology&title=Genome%20sequence%20of%20the%20dissimilatory%20metal%20ion-reducing%20bacterium%20Shewanella%20oneidensis&author=John%20F.%20Heidelberg,%20Ian%20T.%20Paulsen,%20Karen%20E.%20Nelson,%20Eric%20J.%20Gaidos,%20William%20C.%20Nelson%20et%20al.&contentID=10.1038/nbt749&publicationDate=10/07/2002&volumeNum=20&issueNum=11&numPages=6&pageNumbers=pp1118-1123 site].
*Do the authors own the rights under a Creative Commons license?
**Yes, according to this [http://oaspa.org/member/nature-publishing-group-palgrave-macmillan/ site].
*Is the article available “Open Access”?
**According to [http://oaspa.org/membership/members/ this site], the article is available "Open Access".
*What organization is the publisher of the article? What type of organization is it?
**According to the site above, this publisher is a "Professional OA Publisher (Large)".
*Is this article available in print or online only?
**Online only. It was published online in November, 2002.
*Has LMU paid a subscription or other fee for your access to this article?
**No.
*We performed a search in the ISI Web of Science/Knowledge database by typing in the title "Genome sequence of the dissimilatory metal ion–reducing bacterium Shewanella oneidensis" to the search bar.
**Three articles came up as results. The first two articles title's did not exactly match, and were cited under 15 times each. The third article was the article we were searching for.
*How many articles does this article cite?
**This article has 41 cited references within the Web of Science Core Collection, according to this [https://apps.webofknowledge.com/full_record.do?product=UA&search_mode=GeneralSearch&qid=3&SID=3Evs6J6HvCojNOHG6K3&page=1&doc=3 site].
*How many articles cite this article?
**It has been cited 1079 times in all databases, and 426 within the Web of Science Core Collection, according to this [https://apps.webofknowledge.com/full_record.do?product=UA&search_mode=GeneralSearch&qid=3&SID=3Evs6J6HvCojNOHG6K3&page=1&doc=3 site].
*Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced?
**Examples of titles that reference the genome paper:
***Environmental genome shotgun sequencing of the Sargasso Sea
***Deciphering the evolution and metabolism of an anammox bacterium from a community genome
***Genome of Geobacter sulfurreducens: Metal reduction in subsurface environments
***More can be found by clicking this [https://apps.webofknowledge.com/summary.do?product=WOS&parentProduct=UA&search_mode=CitingArticles&qid=8&SID=3Evs6J6HvCojNOHG6K3&page=1&action=sort&sortBy=LC.D;PY.D;AU.A.en;SO.A.en;VL.D;PG.A&showFirstPage=1 link].
**These papers include studying within in the species, finding out the genomes of other species, as well as the metabolic versatility of microorganisms and metal ion reduction in environments. This shows that a sequenced genome can aide in experiments of many kinds.

===Our Microarray Paper===
*Dataset can be found at this [https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-15334/?keywords=&organism=Shewanella+oneidensis&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array= link].


====E-GEOD-15334: Yang et. al (2009)====

This paper is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 09:41, 10 November 2015 (PST)''

Yang, Y., Harris, D. P., Luo, F., Xiong, W., Joachimiak, M., Wu, L., ... & Zhou, J. (2009). Snapshot of iron response in Shewanella oneidensis by gene network reconstruction. ''BMC genomics, 10''(1), 131.
*The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed/?term=Yang%2C+Y.%2C+Harris%2C+D.+P.%2C+Luo%2C+F.%2C+Xiong%2C+W.%2C+Joachimiak%2C+M.%2C+Wu%2C+L.%2C+...+%26+Zhou%2C+J.+%282009%29.+Snapshot+of+iron+response+in+Shewanella+oneidensis+by+gene+network+reconstruction.+BMC+genomics%2C+10%281%29%2C+131. PubMed].
*The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2667191/ PubMedCentral]
*The link to the full text of the article (HTML format) from the publisher [http://www.biomedcentral.com/1471-2164/10/131 web site].
*The link to the full [http://www.biomedcentral.com/content/pdf/1471-2164-10-131.pdf PDF] version of the article from the publisher web site.
*Who owns the rights to the article?
**The article is Open Access and the authors own the rights under a Creative Commons license.
*What organization is the publisher of the article? What type of organization is it?
**BMC Genomics is the publisher, which is a scientific society
*Is this article available in print or online only?
**It is online only
*Has LMU paid a subscription or other fee for your access to this article?
**No
*How many articles does this article cite?
**This paper sites 48 other articles
*How many articles cite this article?
**3
***Roles of UndA and MtrC of ''Shewanella putrefaciens'' W3-18-1 in iron reduction
***Global transcriptional response of ''Caulobacter crescentus'' to iron availability
***Molecular ecological network analysis
*Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced?
**This article has mostly been used to look at the iron response of other strains or organisms. It may have been used for comparison's sake or to modify the original methodology to fit the new experiment.
*Link to microarray data
**Found it on [https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-15334/ ArrayExpress]
**This contains the raw data that we will use for our research
*What experiment was performed? What was the "treatment" and what was the "control" in the experiment?
**Strains of ''Shewanella oneidensis'' were put under iron depletion and repletion conditions. The control would be a regular strain of the organism, while the treatments would be either increasing or decreasing the iron levels.
*Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each?
**4 biological replicates of each treatment condition were performed

Heavy Metal HaterZ Deliverables

2015-12-03T23:28:55Z

Jkuroda: group files & datasets

== Group Files and Datasets ==

* [[Media:So-Std 20151201special.zip | GenMAPP Gene Database for assigned species]]
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

[[Gene Database Project Deliverables]]
[[Category:Heavy Metal HaterZ]]
[[Category:Group Projects]]

Jkuroda Week 15

2015-12-03T23:24:20Z

Jkuroda: fixing

==Individual Assessment & Reflection==
=== Statement of Work ===

* Describe exactly what you did on the project.
* Provide references or links to artifacts of your work, such as:
** Wiki pages
** Other files or documents
** Code or scripts

=== Assessment of Project ===

* Give an objective assessment of the success of your project workflow and teamwork.
* What worked and what didn't work?
* What would you do differently if you could do it all over again?
* Evaluate the Gene Database Project and Group Report in the following areas:
*# Content: What is the quality of the work?
*# Organization: Comment on the organization of the project and of your group's wiki pages.
*# Completeness: Did your team achieve all of the project objectives? Why or why not?

=== Reflection on the Process ===

* What did you learn?
** With your head (biological or computer science principles)
** With your heart (personal qualities and teamwork qualities that make things work or not work)?
** With your hands (technical skills)?
* What lesson will you take away from this project that you will still use a year from now?

{{Template:Journal Template}}

Jkuroda Week 15

2015-12-03T23:24:02Z

Jkuroda: minor

Individual Assessment & Reflection
=== Statement of Work ===

* Describe exactly what you did on the project.
* Provide references or links to artifacts of your work, such as:
** Wiki pages
** Other files or documents
** Code or scripts

=== Assessment of Project ===

* Give an objective assessment of the success of your project workflow and teamwork.
* What worked and what didn't work?
* What would you do differently if you could do it all over again?
* Evaluate the Gene Database Project and Group Report in the following areas:
*# Content: What is the quality of the work?
*# Organization: Comment on the organization of the project and of your group's wiki pages.
*# Completeness: Did your team achieve all of the project objectives? Why or why not?

=== Reflection on the Process ===

* What did you learn?
** With your head (biological or computer science principles)
** With your heart (personal qualities and teamwork qualities that make things work or not work)?
** With your hands (technical skills)?
* What lesson will you take away from this project that you will still use a year from now?

{{Template:Journal Template}}

Jkuroda Week 15

2015-12-03T23:23:09Z

Jkuroda: template

=== Statement of Work ===

* Describe exactly what you did on the project.
* Provide references or links to artifacts of your work, such as:
** Wiki pages
** Other files or documents
** Code or scripts

=== Assessment of Project ===

* Give an objective assessment of the success of your project workflow and teamwork.
* What worked and what didn't work?
* What would you do differently if you could do it all over again?
* Evaluate the Gene Database Project and Group Report in the following areas:
*# Content: What is the quality of the work?
*# Organization: Comment on the organization of the project and of your group's wiki pages.
*# Completeness: Did your team achieve all of the project objectives? Why or why not?

=== Reflection on the Process ===

* What did you learn?
** With your head (biological or computer science principles)
** With your heart (personal qualities and teamwork qualities that make things work or not work)?
** With your hands (technical skills)?
* What lesson will you take away from this project that you will still use a year from now?

{{Template:Journal Template}}

Gene Database Testing Report - Heavy Metal HaterZ

2015-12-03T23:08:25Z

Jkuroda: /* Analysis */ adding info about 11 IDs

{{Heavy Metal HaterZ}}
==Export Information==

Version of GenMAPP Builder: '''3 build 5'''

Computer on which export was run: '''HP Compaq 8300 Elite SFF FC'''

Postgres Database name: '''S. Oneidensis'''

UniProt XML filename: [[Media:SOneidensisUNIPROT.gz | '''SOneidensisUNIPROT''']]
* UniProt XML version: '''UniProt release 2015_10 - October 14, 2015'''
* [http://www.uniprot.org/uniprot/?query=taxonomy:211586 UniProt XML download link]
* Time taken to import: '''3.18 minutes'''
** Note: ''n/a''

GO OBO-XML filename: [[Media:Go daily-termdb.obo-xml.gz | '''go daily-termdb.obo-xml''']]
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped):
* [http://geneontology.org/page/download-ontology#Legacy_Downloads GO OBO-XML download link]
* Time taken to import: '''7.16 minutes'''
* Time taken to process: '''4.27 minutes'''
** Note: ''n/a''

GOA filename: [[Media:ShewanellaOneidensisGOA.zip | '''ShewanellaOneidensisGOA''']]
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''October 14, 2015 - GOA Proteome Sets 124'''
* [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/106.S_oneidensis.goa GOA download link]
* Time taken to import: '''0.05 minutes'''
** Note: ''n/a''

Name of .gdb file: [[Media:So-Std 20151119HMH.zip | '''So-Std 20151119HMH.gdb''']]
* Time taken to export: '''1 hour and 18 minutes'''
** Start time: '''3:48pm'''
** End time: '''5:06pm'''
** Note: ''n/a''

==TallyEngine==

* Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
** Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
** Take a screenshot of the results. Upload the image to the wiki and display it on this page.
** '''4196''' IDs
** For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]

[[Image:TallyEngineSOneidensis.PNG | center | 540px]]

== Using XMLPipeDB match to Validate the XML Results from the TallyEngine==

[[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | Follow the instructions found on this page to run XMLPipeDB match.]]

Are your results the same as you got for the TallyEngine? Why or why not?
* Initially, we got '''4196''' IDs for both XML and Postgres DB from TallyEngine but got '''4079''' IDs by using XMLPipeDB match. This result was from using the following command:
java -jar xmlpipedb-match-1.1.1.jar "SO_[0-9][0-9][0-9][0-9]" < SOneidensisUNIPROT

* After checking the .gdb file and looking through the Gene IDs, I found that some IDs were in the form "SO_A####" so I ran a new command accounting for this:
java -jar xmlpipedb-match-1.1.1.jar "SO_A?[0-9][0-9][0-9][0-9]" < SOneidensisUNIPROT
* This gave a total number of '''4207''' IDs.

== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==

For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]

You can also look for counts at the SQL level, using some variation of a ''select count(*)'' query. This requires some knowledge of which table received what data. Here’s an initial tip: the ''gene/name'' tags in the XML file land in the ''genenametype'' table. A query on this table counting values from this table that were marked as ''ordered locus'' in the XML file matching the pattern ''SO_[0-9][0-9][0-9][0-9]'' would look like this:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'SO_[0-9][0-9][0-9][0-9]';
* However, once I found that some IDs were in the form "SO_A####" I tweaked the pattern to account for those IDs:
select count(*) from genenametype where type = 'ordered locus' and value ~ 'SO_A?[0-9][0-9][0-9][0-9]';

In ''pgAdmin III'', you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the ''SQL Editor'' tab, then clicking on the green triangular ''Play'' button to run.

Are your results the same as reported by the TallyEngine? Why or why not?
* Initially, we got a count of '''4068''' IDs using SQL, which differed from the '''4196''' IDs from TallyEngine.
* After tweaking the pattern to account for IDs with that extra ''A'', we got a grand total of '''4196''' IDs, which matches with what TallyEngine gave us.

==OriginalRowCounts Comparison==

Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.

Benchmark .gdb file: [http://sourceforge.net/projects/xmlpipedb/files/V.%20cholerae%20Gene%20Database/V.%20cholerae%2020101022/Vc-Std_External_20101022.zip/download 2010 benchmark file]

Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:

[[Media:OriginalRowCounts.pdf | Original Row Counts Table]]

[[Media:OriginalRowCounts2010.pdf | 2010 Benchmark Original Row Counts Table]]

See Analysis section for more on the comparison and the discrepancy found because of this comparison.

Note: Using Microsoft Access, we found ''7664'' IDs, which was actually double the number of IDs present because of duplicated IDs that did not have an underscore.

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.

* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
** No. For the current version, a good number of gene ID systems in the database do not have a value for the date field. Some systems that lack a date include: GenBank, UniGene, WormBase, and EcoGene.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
** For the UniProt table, the IDs start with either ''Q8'' or ''K4'' and have some string of four letters/numbers trailing that. For the RefSeq tables, the IDs have forms that start with either ''NP_'' or ''WP_'', with the ''WP_'' forms having 9 numbers afterwards and the ''NP_'' forms having 6 numbers afterwards. For the OrderedLocusNames table, the IDs either start with <code>SO_</code> or <code>SO_A</code>.

Note: ''n/a''

==Analysis==

Consolidating the counts of gene IDs from the various methods, I got:

* 4196 IDs from Tally Engine
* ''4207'' IDs from xmlpipedb-match
java -jar xmlpipedb-match-1.1.1.jar "SO_A?[0-9][0-9][0-9][0-9]" < SOneidensisUNIPROT
* 4196 IDs from PostgreSQL
select count(*) from genenametype where type = 'ordered locus' and value ~ 'SO_A?[0-9][0-9][0-9][0-9]';
* 4196 IDs from Microsoft Access (Noted that there were 4068 IDs in the form <code>SO_####</code> and 128 IDs in the form <code>SO_A####</code>)

Notice that there is a small but significant discrepancy in that there seems to be eleven more IDs when using xmlpipedb-match. This is troubling because of the fact that the other three methods seemed to confirm a total count of 4196. So, I used Microsoft Excel to compare the list of gene IDs from the actual .gdb file and the list I got back from xmlpipedb-match. As you can see on [[Media:Comparing gdb and xmlpipedb-match.xlsx | this document]], there are 11 IDs in the xmlpipedb-match column that are not found in the gdb column. This discrepancy was further pointed out by the use of some <code>match</code> functions to see where an ID was missing from either list. Below are the two match functions I used in the document:
=MATCH(A2, B$2:B$4208, 0)
=MATCH(B2, A$2:A$4208, 0)
Below are the eleven IDs in question:
SO_3699
SO_1312
SO_4269
SO_2875
SO_4532
SO_4580
SO_2662
SO_4423
SO_3156
SO_2967
SO_2024

None of these IDs are in our MOD.
We searched for these IDs in the microarray statistical analysis sheet and did not find the following IDs:
SO_4269, SO_2875, SO_4580

As of 12/01/15, we are waiting on input from Professor Dahlquist to see if we will adjust our GenMAPP Builder to account for these 11 IDs.
* A manual inspection was done on the [[Media:SOneidensisUNIPROT.gz | '''SOneidensisUNIPROT''']] XML file and it looks like these 11 IDs are contained within entries that are missing a gene tag, which explains why the other methods only picked up 4196 IDs.

==.gdb Use in GenMAPP==

While the above sections perform quality assurance on the exported Gene Database via verifying ID counts, the "proof in the pudding" is to actually use the Gene Database in GenMAPP. You can follow the instructions in [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] to verify that the Gene Database works in GenMAPP. In this case, the emphasis is not on the findings of the data analysis itself, but that the Gene Database functions appropriate in GenMAPP.

For assistance with using the GenMAPP program, the GenMAPP Help is very extensive. To access it within GenMAPP, go to the menu item Help > GenMAPP Help and either browse or search for your topic of interest.

Note: ''n/a''

===Putting a gene on the MAPP using the GeneFinder window===

* In the main GenMAPP Drafting Board window, left-click on the icon for "Gene" in the upper left corner of the window. Click on the Drafting Board to place the Gene on the MAPP. Now, right-click on the gene to access the GeneFinder window. Type or paste a gene ID into the Gene ID field. Select the appropriate Gene ID system from the drop-down menu and click the Search button. For example, for ''Vibrio cholerae'', you could search for the ID "VC0028", which is an OrderedLocusNames ID. Once the ID has been found, click the OK button to return to the Drafting Board window.
** For the Final Project, you will need to try a sample ID from each of the gene ID systems, not just OrderedLocusNames.
* Open the Backpage by left-clicking on the gene box on the Drafting Board to see if all of the cross-referenced IDs that are supposed to be there are there.

Note: I tried out the search for a gene ID and was able to bring up the Backpage for that ID. The cross-referenced IDs that were supposed to show up were indeed on the page.

===Creating an Expression Dataset in the Expression Dataset Manager===

* How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

Note: The Expression Dataset Manager reported that there were 121 errors during the conversion. From looking over the error codes, I found that all errors were of the form: <code>Gene not found in OrderedLocusNames or any related system.</code> It seems that these IDs were in fact not present in the UniProt XML.

===Coloring a MAPP with expression data===

Note: I was able to successfully color the MAPP by coloring the increased and decreased Log Fold Changes.

===Running MAPPFinder===

Note: After the results had been calculated, a Gene Ontology browser opened showing my results. All of the Gene Ontology terms that have at least 3 genes measured and a p value of less than 0.05 were highlighted yellow. A term with a p value less than 0.05 is considered a "significant" result. Browsed through the tree to see the results.

Documents produced from this run-through can be found here: [[Media:Week 9 genMAPP and MAPPFinder.zip | week 9 docs]]

Jkuroda Week 14

2015-12-02T01:43:15Z

Jkuroda: more log

==Log==
* In class on 12/01/15, I went through the rest of the Testing Report and did the SQL Queries section and manually checked the gene IDs in the .gdb file.
* After looking at the IDs, I found that some of them were in the form "SO_A?####"
* Upon discovering this, I went back and tweaked the pattern for the XMLPipeDB and SQL commands. I got a count of ''4207'' IDs for XMLPipeDB and ''4196'' for PostgreSQL.
* Did a visual inspection of individual tables in the gdb file.
* Conducted an analysis of the results collected and found that the xmlpipedb-match had 11 extra IDs because the UNIPROT XML file contained 11 entries with no gene tag.
* Ran an export using the new genMAPP Builder given to me by Mary.
** Started: 4:12pm
** Finished: 5:26pm
** Total time taken: 1 hour and 14 minutes
* Uploaded the resulting .gdb file: [[Media:So-Std 20151201special.zip | new .gdb file]]

{{Template:Journal Template}}

File:So-Std 20151201special.zip

2015-12-02T01:42:21Z

Jkuroda: heavy metal haterZ

heavy metal haterZ

Gene Database Testing Report - Heavy Metal HaterZ

2015-12-02T01:35:00Z

Jkuroda: /* Analysis */ analysis for SO

{{Heavy Metal HaterZ}}
==Export Information==

Version of GenMAPP Builder: '''3 build 5'''

Computer on which export was run: '''HP Compaq 8300 Elite SFF FC'''

Postgres Database name: '''S. Oneidensis'''

UniProt XML filename: [[Media:SOneidensisUNIPROT.gz | '''SOneidensisUNIPROT''']]
* UniProt XML version: '''UniProt release 2015_10 - October 14, 2015'''
* [http://www.uniprot.org/uniprot/?query=taxonomy:211586 UniProt XML download link]
* Time taken to import: '''3.18 minutes'''
** Note: ''n/a''

GO OBO-XML filename: [[Media:Go daily-termdb.obo-xml.gz | '''go daily-termdb.obo-xml''']]
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped):
* [http://geneontology.org/page/download-ontology#Legacy_Downloads GO OBO-XML download link]
* Time taken to import: '''7.16 minutes'''
* Time taken to process: '''4.27 minutes'''
** Note: ''n/a''

GOA filename: [[Media:ShewanellaOneidensisGOA.zip | '''ShewanellaOneidensisGOA''']]
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''October 14, 2015 - GOA Proteome Sets 124'''
* [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/106.S_oneidensis.goa GOA download link]
* Time taken to import: '''0.05 minutes'''
** Note: ''n/a''

Name of .gdb file: [[Media:So-Std 20151119HMH.zip | '''So-Std 20151119HMH.gdb''']]
* Time taken to export: '''1 hour and 18 minutes'''
** Start time: '''3:48pm'''
** End time: '''5:06pm'''
** Note: ''n/a''

==TallyEngine==

* Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
** Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
** Take a screenshot of the results. Upload the image to the wiki and display it on this page.
** '''4196''' IDs
** For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]

[[Image:TallyEngineSOneidensis.PNG | center | 540px]]

== Using XMLPipeDB match to Validate the XML Results from the TallyEngine==

[[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | Follow the instructions found on this page to run XMLPipeDB match.]]

Are your results the same as you got for the TallyEngine? Why or why not?
* Initially, we got '''4196''' IDs for both XML and Postgres DB from TallyEngine but got '''4079''' IDs by using XMLPipeDB match. This result was from using the following command:
java -jar xmlpipedb-match-1.1.1.jar "SO_[0-9][0-9][0-9][0-9]" < SOneidensisUNIPROT

* After checking the .gdb file and looking through the Gene IDs, I found that some IDs were in the form "SO_A####" so I ran a new command accounting for this:
java -jar xmlpipedb-match-1.1.1.jar "SO_A?[0-9][0-9][0-9][0-9]" < SOneidensisUNIPROT
* This gave a total number of '''4207''' IDs.

== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==

For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]

You can also look for counts at the SQL level, using some variation of a ''select count(*)'' query. This requires some knowledge of which table received what data. Here’s an initial tip: the ''gene/name'' tags in the XML file land in the ''genenametype'' table. A query on this table counting values from this table that were marked as ''ordered locus'' in the XML file matching the pattern ''SO_[0-9][0-9][0-9][0-9]'' would look like this:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'SO_[0-9][0-9][0-9][0-9]';
* However, once I found that some IDs were in the form "SO_A####" I tweaked the pattern to account for those IDs:
select count(*) from genenametype where type = 'ordered locus' and value ~ 'SO_A?[0-9][0-9][0-9][0-9]';

In ''pgAdmin III'', you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the ''SQL Editor'' tab, then clicking on the green triangular ''Play'' button to run.

Are your results the same as reported by the TallyEngine? Why or why not?
* Initially, we got a count of '''4068''' IDs using SQL, which differed from the '''4196''' IDs from TallyEngine.
* After tweaking the pattern to account for IDs with that extra ''A'', we got a grand total of '''4196''' IDs, which matches with what TallyEngine gave us.

==OriginalRowCounts Comparison==

Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.

Benchmark .gdb file: [http://sourceforge.net/projects/xmlpipedb/files/V.%20cholerae%20Gene%20Database/V.%20cholerae%2020101022/Vc-Std_External_20101022.zip/download 2010 benchmark file]

Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:

[[Media:OriginalRowCounts.pdf | Original Row Counts Table]]

[[Media:OriginalRowCounts2010.pdf | 2010 Benchmark Original Row Counts Table]]

See Analysis section for more on the comparison and the discrepancy found because of this comparison.

Note: Using Microsoft Access, we found ''7664'' IDs, which was actually double the number of IDs present because of duplicated IDs that did not have an underscore.

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.

* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
** No. For the current version, a good number of gene ID systems in the database do not have a value for the date field. Some systems that lack a date include: GenBank, UniGene, WormBase, and EcoGene.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
** For the UniProt table, the IDs start with either ''Q8'' or ''K4'' and have some string of four letters/numbers trailing that. For the RefSeq tables, the IDs have forms that start with either ''NP_'' or ''WP_'', with the ''WP_'' forms having 9 numbers afterwards and the ''NP_'' forms having 6 numbers afterwards. For the OrderedLocusNames table, the IDs either start with <code>SO_</code> or <code>SO_A</code>.

Note: ''n/a''

==Analysis==

Consolidating the counts of gene IDs from the various methods, I got:

* 4196 IDs from Tally Engine
* ''4207'' IDs from xmlpipedb-match
java -jar xmlpipedb-match-1.1.1.jar "SO_A?[0-9][0-9][0-9][0-9]" < SOneidensisUNIPROT
* 4196 IDs from PostgreSQL
select count(*) from genenametype where type = 'ordered locus' and value ~ 'SO_A?[0-9][0-9][0-9][0-9]';
* 4196 IDs from Microsoft Access (Noted that there were 4068 IDs in the form <code>SO_####</code> and 128 IDs in the form <code>SO_A####</code>)

Notice that there is a small but significant discrepancy in that there seems to be eleven more IDs when using xmlpipedb-match. This is troubling because of the fact that the other three methods seemed to confirm a total count of 4196. So, I used Microsoft Excel to compare the list of gene IDs from the actual .gdb file and the list I got back from xmlpipedb-match. As you can see on [[Media:Comparing gdb and xmlpipedb-match.xlsx | this document]], there are 11 IDs in the xmlpipedb-match column that are not found in the gdb column. This discrepancy was further pointed out by the use of some <code>match</code> functions to see where an ID was missing from either list. Below are the two match functions I used in the document:
=MATCH(A2, B$2:B$4208, 0)
=MATCH(B2, A$2:A$4208, 0)
Below are the eleven IDs in question:
SO_3699
SO_1312
SO_4269
SO_2875
SO_4532
SO_4580
SO_2662
SO_4423
SO_3156
SO_2967
SO_2024

As of 12/01/15, we are waiting on input from Professor Dahlquist to see if we will adjust our GenMAPP Builder to account for these 11 IDs.
* A manual inspection was done on the [[Media:SOneidensisUNIPROT.gz | '''SOneidensisUNIPROT''']] XML file and it looks like these 11 IDs are contained within entries that are missing a gene tag, which explains why the other methods only picked up 4196 IDs.

==.gdb Use in GenMAPP==

While the above sections perform quality assurance on the exported Gene Database via verifying ID counts, the "proof in the pudding" is to actually use the Gene Database in GenMAPP. You can follow the instructions in [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] to verify that the Gene Database works in GenMAPP. In this case, the emphasis is not on the findings of the data analysis itself, but that the Gene Database functions appropriate in GenMAPP.

For assistance with using the GenMAPP program, the GenMAPP Help is very extensive. To access it within GenMAPP, go to the menu item Help > GenMAPP Help and either browse or search for your topic of interest.

Note: ''n/a''

===Putting a gene on the MAPP using the GeneFinder window===

* In the main GenMAPP Drafting Board window, left-click on the icon for "Gene" in the upper left corner of the window. Click on the Drafting Board to place the Gene on the MAPP. Now, right-click on the gene to access the GeneFinder window. Type or paste a gene ID into the Gene ID field. Select the appropriate Gene ID system from the drop-down menu and click the Search button. For example, for ''Vibrio cholerae'', you could search for the ID "VC0028", which is an OrderedLocusNames ID. Once the ID has been found, click the OK button to return to the Drafting Board window.
** For the Final Project, you will need to try a sample ID from each of the gene ID systems, not just OrderedLocusNames.
* Open the Backpage by left-clicking on the gene box on the Drafting Board to see if all of the cross-referenced IDs that are supposed to be there are there.

Note: I tried out the search for a gene ID and was able to bring up the Backpage for that ID. The cross-referenced IDs that were supposed to show up were indeed on the page.

===Creating an Expression Dataset in the Expression Dataset Manager===

* How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

Note: The Expression Dataset Manager reported that there were 121 errors during the conversion. From looking over the error codes, I found that all errors were of the form: <code>Gene not found in OrderedLocusNames or any related system.</code> It seems that these IDs were in fact not present in the UniProt XML.

===Coloring a MAPP with expression data===

Note: I was able to successfully color the MAPP by coloring the increased and decreased Log Fold Changes.

===Running MAPPFinder===

Note: After the results had been calculated, a Gene Ontology browser opened showing my results. All of the Gene Ontology terms that have at least 3 genes measured and a p value of less than 0.05 were highlighted yellow. A term with a p value less than 0.05 is considered a "significant" result. Browsed through the tree to see the results.

Documents produced from this run-through can be found here: [[Media:Week 9 genMAPP and MAPPFinder.zip | week 9 docs]]

File:Comparing gdb and xmlpipedb-match.xlsx

2015-12-02T01:34:43Z

Jkuroda: heavy metal haterZ

heavy metal haterZ

Gene Database Testing Report - Heavy Metal HaterZ

2015-12-01T23:17:03Z

Jkuroda: /* Visual Inspection */ visual

{{Heavy Metal HaterZ}}
==Export Information==

Version of GenMAPP Builder: '''3 build 5'''

Computer on which export was run: '''HP Compaq 8300 Elite SFF FC'''

Postgres Database name: '''S. Oneidensis'''

UniProt XML filename: [[Media:SOneidensisUNIPROT.gz | '''SOneidensisUNIPROT''']]
* UniProt XML version: '''UniProt release 2015_10 - October 14, 2015'''
* [http://www.uniprot.org/uniprot/?query=taxonomy:211586 UniProt XML download link]
* Time taken to import: '''3.18 minutes'''
** Note: ''n/a''

GO OBO-XML filename: [[Media:Go daily-termdb.obo-xml.gz | '''go daily-termdb.obo-xml''']]
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped):
* [http://geneontology.org/page/download-ontology#Legacy_Downloads GO OBO-XML download link]
* Time taken to import: '''7.16 minutes'''
* Time taken to process: '''4.27 minutes'''
** Note: ''n/a''

GOA filename: [[Media:ShewanellaOneidensisGOA.zip | '''ShewanellaOneidensisGOA''']]
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''October 14, 2015 - GOA Proteome Sets 124'''
* [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/106.S_oneidensis.goa GOA download link]
* Time taken to import: '''0.05 minutes'''
** Note: ''n/a''

Name of .gdb file: [[Media:So-Std 20151119HMH.zip | '''So-Std 20151119HMH.gdb''']]
* Time taken to export: '''1 hour and 18 minutes'''
** Start time: '''3:48pm'''
** End time: '''5:06pm'''
** Note: ''n/a''

==TallyEngine==

* Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
** Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
** Take a screenshot of the results. Upload the image to the wiki and display it on this page.
** '''4196''' IDs
** For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]

[[Image:TallyEngineSOneidensis.PNG | center | 540px]]

== Using XMLPipeDB match to Validate the XML Results from the TallyEngine==

[[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | Follow the instructions found on this page to run XMLPipeDB match.]]

Are your results the same as you got for the TallyEngine? Why or why not?
* Initially, we got '''4196''' IDs for both XML and Postgres DB from TallyEngine but got '''4079''' IDs by using XMLPipeDB match. This result was from using the following command:
java -jar xmlpipedb-match-1.1.1.jar "SO_[0-9][0-9][0-9][0-9]" < SOneidensisUNIPROT

* After checking the .gdb file and looking through the Gene IDs, I found that some IDs were in the form "SO_A####" so I ran a new command accounting for this:
java -jar xmlpipedb-match-1.1.1.jar "SO_A?[0-9][0-9][0-9][0-9]" < SOneidensisUNIPROT
* This gave a total number of '''4207''' IDs.

== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==

For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]

You can also look for counts at the SQL level, using some variation of a ''select count(*)'' query. This requires some knowledge of which table received what data. Here’s an initial tip: the ''gene/name'' tags in the XML file land in the ''genenametype'' table. A query on this table counting values from this table that were marked as ''ordered locus'' in the XML file matching the pattern ''SO_[0-9][0-9][0-9][0-9]'' would look like this:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'SO_[0-9][0-9][0-9][0-9]';
* However, once I found that some IDs were in the form "SO_A####" I tweaked the pattern to account for those IDs:
select count(*) from genenametype where type = 'ordered locus' and value ~ 'SO_A?[0-9][0-9][0-9][0-9]';

In ''pgAdmin III'', you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the ''SQL Editor'' tab, then clicking on the green triangular ''Play'' button to run.

Are your results the same as reported by the TallyEngine? Why or why not?
* Initially, we got a count of '''4068''' IDs using SQL, which differed from the '''4196''' IDs from TallyEngine.
* After tweaking the pattern to account for IDs with that extra ''A'', we got a grand total of '''4196''' IDs, which matches with what TallyEngine gave us.

==OriginalRowCounts Comparison==

Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.

Benchmark .gdb file: [http://sourceforge.net/projects/xmlpipedb/files/V.%20cholerae%20Gene%20Database/V.%20cholerae%2020101022/Vc-Std_External_20101022.zip/download 2010 benchmark file]

Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:

[[Media:OriginalRowCounts.pdf | Original Row Counts Table]]

[[Media:OriginalRowCounts2010.pdf | 2010 Benchmark Original Row Counts Table]]

See Analysis section for more on the comparison and the discrepancy found because of this comparison.

Note: Using Microsoft Access, we found ''7664'' IDs, which was actually double the number of IDs present because of duplicated IDs that did not have an underscore.

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.

* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
** No. For the current version, a good number of gene ID systems in the database do not have a value for the date field. Some systems that lack a date include: GenBank, UniGene, WormBase, and EcoGene.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
** For the UniProt table, the IDs start with either ''Q8'' or ''K4'' and have some string of four letters/numbers trailing that. For the RefSeq tables, the IDs have forms that start with either ''NP_'' or ''WP_'', with the ''WP_'' forms having 9 numbers afterwards and the ''NP_'' forms having 6 numbers afterwards. For the OrderedLocusNames table, the IDs either start with <code>SO_</code> or <code>SO_A</code>.

Note: ''n/a''

==Analysis==

Consolidating the counts of gene IDs from the various methods, I got:

* 3831 IDs from Tally Engine
* 3831 IDs from xmlpipedb-match
java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < uniprot-organism%3A243277.xml
* 3831 IDs from PostgreSQL
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';
* ''3832'' IDs from Microsoft Access (counting the IDs with an underscore)

Notice that there is a small but significant discrepancy in that there seems to be one more ID when we used Microsoft Access. This is troubling because of the fact that the other three methods seemed to confirm a total count of 3831. So, I used Microsoft Excel to compare the list of gene IDs from the actual .gdb file and the list I got back from PostgreSQL. As you can see on [[Media:Comparing gdb and postgresql.xlsx | this document]], line 64 shows us that the PostgreSQL list has an ID input of "VC_1738/VC_1739." This discrepancy was further pointed out by the use of some <code>match</code> functions to see where an ID was missing from either list. This discrepancy accounts for the issue of the missing ID, because of the fact that two IDs were apparently joined with a slash. Below are the two match functions I used in the document:
=MATCH(A2, B$2:B$7665, 0)
=MATCH(B2, A$2:A$7665, 0)

==.gdb Use in GenMAPP==

While the above sections perform quality assurance on the exported Gene Database via verifying ID counts, the "proof in the pudding" is to actually use the Gene Database in GenMAPP. You can follow the instructions in [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] to verify that the Gene Database works in GenMAPP. In this case, the emphasis is not on the findings of the data analysis itself, but that the Gene Database functions appropriate in GenMAPP.

For assistance with using the GenMAPP program, the GenMAPP Help is very extensive. To access it within GenMAPP, go to the menu item Help > GenMAPP Help and either browse or search for your topic of interest.

Note: ''n/a''

===Putting a gene on the MAPP using the GeneFinder window===

* In the main GenMAPP Drafting Board window, left-click on the icon for "Gene" in the upper left corner of the window. Click on the Drafting Board to place the Gene on the MAPP. Now, right-click on the gene to access the GeneFinder window. Type or paste a gene ID into the Gene ID field. Select the appropriate Gene ID system from the drop-down menu and click the Search button. For example, for ''Vibrio cholerae'', you could search for the ID "VC0028", which is an OrderedLocusNames ID. Once the ID has been found, click the OK button to return to the Drafting Board window.
** For the Final Project, you will need to try a sample ID from each of the gene ID systems, not just OrderedLocusNames.
* Open the Backpage by left-clicking on the gene box on the Drafting Board to see if all of the cross-referenced IDs that are supposed to be there are there.

Note: I tried out the search for a gene ID and was able to bring up the Backpage for that ID. The cross-referenced IDs that were supposed to show up were indeed on the page.

===Creating an Expression Dataset in the Expression Dataset Manager===

* How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

Note: The Expression Dataset Manager reported that there were 121 errors during the conversion. From looking over the error codes, I found that all errors were of the form: <code>Gene not found in OrderedLocusNames or any related system.</code> It seems that these IDs were in fact not present in the UniProt XML.

===Coloring a MAPP with expression data===

Note: I was able to successfully color the MAPP by coloring the increased and decreased Log Fold Changes.

===Running MAPPFinder===

Note: After the results had been calculated, a Gene Ontology browser opened showing my results. All of the Gene Ontology terms that have at least 3 genes measured and a p value of less than 0.05 were highlighted yellow. A term with a p value less than 0.05 is considered a "significant" result. Browsed through the tree to see the results.

Documents produced from this run-through can be found here: [[Media:Week 9 genMAPP and MAPPFinder.zip | week 9 docs]]

Jkuroda Week 14

2015-12-01T23:11:10Z

Jkuroda: logging

Gene Database Testing Report - Heavy Metal HaterZ

2015-12-01T23:07:44Z

Jkuroda: /* Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine */ tweaking

{{Heavy Metal HaterZ}}
==Export Information==

Version of GenMAPP Builder: '''3 build 5'''

Computer on which export was run: '''HP Compaq 8300 Elite SFF FC'''

Postgres Database name: '''S. Oneidensis'''

UniProt XML filename: [[Media:SOneidensisUNIPROT.gz | '''SOneidensisUNIPROT''']]
* UniProt XML version: '''UniProt release 2015_10 - October 14, 2015'''
* [http://www.uniprot.org/uniprot/?query=taxonomy:211586 UniProt XML download link]
* Time taken to import: '''3.18 minutes'''
** Note: ''n/a''

GO OBO-XML filename: [[Media:Go daily-termdb.obo-xml.gz | '''go daily-termdb.obo-xml''']]
* GO OBO-XML version (The version information can be found in the file properties after the file downloaded from the [http://beta.geneontology.org/page/download-ontology GO Download page] has been unzipped):
* [http://geneontology.org/page/download-ontology#Legacy_Downloads GO OBO-XML download link]
* Time taken to import: '''7.16 minutes'''
* Time taken to process: '''4.27 minutes'''
** Note: ''n/a''

GOA filename: [[Media:ShewanellaOneidensisGOA.zip | '''ShewanellaOneidensisGOA''']]
* GOA version (News on [http://www.ebi.ac.uk/GOA/ this page] records past releases; current information can be found in the Last modified field on the [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ FTP site]): '''October 14, 2015 - GOA Proteome Sets 124'''
* [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/106.S_oneidensis.goa GOA download link]
* Time taken to import: '''0.05 minutes'''
** Note: ''n/a''

Name of .gdb file: [[Media:So-Std 20151119HMH.zip | '''So-Std 20151119HMH.gdb''']]
* Time taken to export: '''1 hour and 18 minutes'''
** Start time: '''3:48pm'''
** End time: '''5:06pm'''
** Note: ''n/a''

==TallyEngine==

* Run the TallyEngine in GenMAPP Builder and record the number of records for UniProt and GO in the XML data and in the Postgres databases.
** Choose the menu item Tallies > Run XML and Database Tallies for UniProt and GO...
** Take a screenshot of the results. Upload the image to the wiki and display it on this page.
** '''4196''' IDs
** For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]

[[Image:TallyEngineSOneidensis.PNG | center | 540px]]

== Using XMLPipeDB match to Validate the XML Results from the TallyEngine==

[[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | Follow the instructions found on this page to run XMLPipeDB match.]]

Are your results the same as you got for the TallyEngine? Why or why not?
* Initially, we got '''4196''' IDs for both XML and Postgres DB from TallyEngine but got '''4079''' IDs by using XMLPipeDB match. This result was from using the following command:
java -jar xmlpipedb-match-1.1.1.jar "SO_[0-9][0-9][0-9][0-9]" < SOneidensisUNIPROT

* After checking the .gdb file and looking through the Gene IDs, I found that some IDs were in the form "SO_A####" so I ran a new command accounting for this:
java -jar xmlpipedb-match-1.1.1.jar "SO_A?[0-9][0-9][0-9][0-9]" < SOneidensisUNIPROT
* This gave a total number of '''4207''' IDs.

== Using SQL Queries to Validate the PostgreSQL Database Results from the TallyEngine==

For more information, [[How_Do_I_Count_Thee%3F_Let_Me_Count_The_Ways | see this page.]]

You can also look for counts at the SQL level, using some variation of a ''select count(*)'' query. This requires some knowledge of which table received what data. Here’s an initial tip: the ''gene/name'' tags in the XML file land in the ''genenametype'' table. A query on this table counting values from this table that were marked as ''ordered locus'' in the XML file matching the pattern ''SO_[0-9][0-9][0-9][0-9]'' would look like this:

select count(*) from genenametype where type = 'ordered locus' and value ~ 'SO_[0-9][0-9][0-9][0-9]';
* However, once I found that some IDs were in the form "SO_A####" I tweaked the pattern to account for those IDs:
select count(*) from genenametype where type = 'ordered locus' and value ~ 'SO_A?[0-9][0-9][0-9][0-9]';

In ''pgAdmin III'', you can issue these queries by clicking on the pencil/SQL icon in the toolbar, typing the query into the ''SQL Editor'' tab, then clicking on the green triangular ''Play'' button to run.

Are your results the same as reported by the TallyEngine? Why or why not?
* Initially, we got a count of '''4068''' IDs using SQL, which differed from the '''4196''' IDs from TallyEngine.
* After tweaking the pattern to account for IDs with that extra ''A'', we got a grand total of '''4196''' IDs, which matches with what TallyEngine gave us.

==OriginalRowCounts Comparison==

Within the .gdb file, look at the OriginalRowCounts table to see if the database has the expected tables with the expected number of records. Compare the tables and records with a benchmark .gdb file.

Benchmark .gdb file: [http://sourceforge.net/projects/xmlpipedb/files/V.%20cholerae%20Gene%20Database/V.%20cholerae%2020101022/Vc-Std_External_20101022.zip/download 2010 benchmark file]

Copy the OriginalRowCounts table from the benchmark and new gdb and paste them here:

[[Media:OriginalRowCounts.pdf | Original Row Counts Table]]

[[Media:OriginalRowCounts2010.pdf | 2010 Benchmark Original Row Counts Table]]

See Analysis section for more on the comparison and the discrepancy found because of this comparison.

Note: Using Microsoft Access, we found ''7664'' IDs, which was actually double the number of IDs present because of duplicated IDs that did not have an underscore.

==Visual Inspection==

Perform visual inspection of individual tables to see if there are any problems.

* Look at the Systems table. Is there a date in the Date field for all gene ID systems present in the database?
** No. For the current version, a good number of gene ID systems in the database do not have a value for the date field. Some systems that lack a date include: GenBank, UniGene, WormBase, and EcoGene.
** In the 2010 version, there are also missing dates. The systems listed above do not have a date in this version either.
* Open the UniProt, RefSeq, and OrderedLocusNames tables. Scroll down through the table. Do all of the IDs look like they take the correct form for that type of ID?
** For the UniProt tables, both versions seem have ID forms where most start with Q9. For the RefSeq tables, the 2010 version has ID forms that start with NP but my version has forms that start with either NP or WP. For the OrderedLocusNames tables, both versions seem to have the same ID form that either starts with <code>VC</code> or <code>VC_A</code>. Notably, both versions have duplicates because of the removal of the underscore from the gene ID.

Note: ''n/a''

==Analysis==

Consolidating the counts of gene IDs from the various methods, I got:

* 3831 IDs from Tally Engine
* 3831 IDs from xmlpipedb-match
java -jar xmlpipedb-match-1.1.1.jar "VC_A?[0-9][0-9][0-9][0-9]" < uniprot-organism%3A243277.xml
* 3831 IDs from PostgreSQL
select count(*) from genenametype where type = 'ordered locus' and value ~ 'VC_A?[0-9][0-9][0-9][0-9]';
* ''3832'' IDs from Microsoft Access (counting the IDs with an underscore)

Notice that there is a small but significant discrepancy in that there seems to be one more ID when we used Microsoft Access. This is troubling because of the fact that the other three methods seemed to confirm a total count of 3831. So, I used Microsoft Excel to compare the list of gene IDs from the actual .gdb file and the list I got back from PostgreSQL. As you can see on [[Media:Comparing gdb and postgresql.xlsx | this document]], line 64 shows us that the PostgreSQL list has an ID input of "VC_1738/VC_1739." This discrepancy was further pointed out by the use of some <code>match</code> functions to see where an ID was missing from either list. This discrepancy accounts for the issue of the missing ID, because of the fact that two IDs were apparently joined with a slash. Below are the two match functions I used in the document:
=MATCH(A2, B$2:B$7665, 0)
=MATCH(B2, A$2:A$7665, 0)

==.gdb Use in GenMAPP==

While the above sections perform quality assurance on the exported Gene Database via verifying ID counts, the "proof in the pudding" is to actually use the Gene Database in GenMAPP. You can follow the instructions in [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols Part 2 of the ''Vibrio cholerae'' Microarray Data Analysis] to verify that the Gene Database works in GenMAPP. In this case, the emphasis is not on the findings of the data analysis itself, but that the Gene Database functions appropriate in GenMAPP.

For assistance with using the GenMAPP program, the GenMAPP Help is very extensive. To access it within GenMAPP, go to the menu item Help > GenMAPP Help and either browse or search for your topic of interest.

Note: ''n/a''

===Putting a gene on the MAPP using the GeneFinder window===

* In the main GenMAPP Drafting Board window, left-click on the icon for "Gene" in the upper left corner of the window. Click on the Drafting Board to place the Gene on the MAPP. Now, right-click on the gene to access the GeneFinder window. Type or paste a gene ID into the Gene ID field. Select the appropriate Gene ID system from the drop-down menu and click the Search button. For example, for ''Vibrio cholerae'', you could search for the ID "VC0028", which is an OrderedLocusNames ID. Once the ID has been found, click the OK button to return to the Drafting Board window.
** For the Final Project, you will need to try a sample ID from each of the gene ID systems, not just OrderedLocusNames.
* Open the Backpage by left-clicking on the gene box on the Drafting Board to see if all of the cross-referenced IDs that are supposed to be there are there.

Note: I tried out the search for a gene ID and was able to bring up the Backpage for that ID. The cross-referenced IDs that were supposed to show up were indeed on the page.

===Creating an Expression Dataset in the Expression Dataset Manager===

* How many of the IDs were imported out of the total IDs in the microarray dataset? How many exceptions were there? Look in the EX.txt file and look at the error codes for the records that were not imported into the Expression Dataset. Do these represent IDs that were present in the UniProt XML, but were somehow not imported? or were they not present in the UniProt XML?

Note: The Expression Dataset Manager reported that there were 121 errors during the conversion. From looking over the error codes, I found that all errors were of the form: <code>Gene not found in OrderedLocusNames or any related system.</code> It seems that these IDs were in fact not present in the UniProt XML.

===Coloring a MAPP with expression data===

Note: I was able to successfully color the MAPP by coloring the increased and decreased Log Fold Changes.

===Running MAPPFinder===

Note: After the results had been calculated, a Gene Ontology browser opened showing my results. All of the Gene Ontology terms that have at least 3 genes measured and a p value of less than 0.05 were highlighted yellow. A term with a p value less than 0.05 is considered a "significant" result. Browsed through the tree to see the results.

Documents produced from this run-through can be found here: [[Media:Week 9 genMAPP and MAPPFinder.zip | week 9 docs]]