Difference between revisions of "Jkuroda Week 15"
(fixing) |
(→Individual Assessment & Reflection: finishing reflection) |
||
(4 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | ==Log== | ||
+ | * Ran some incomplete statistical analysis data from the GenMAPP users through creating a new expression dataset and generated an exception file which found some issues with our database. | ||
+ | * The first time we ran it, there were exceptions for every single gene, because we did not compensate for the underscore in the ID. After inserting the underscore after the 'SO', we were able to find the actual errors. | ||
+ | * First of all, there are 5408 genes listed in their data, compared to the 4196 genes we have in our database. | ||
+ | * There are 760 gene IDs that are in the form SO_####F, which are genes that don't exist in our database. | ||
+ | * There are 681 gene IDs that are in a 'normal' form (either SO_#### or SO_A####) but do not exist in our database. | ||
+ | * For some of the gene IDs that have 'F's, there are multiple genes of the same ID. | ||
+ | * We attempted to do a batch search on Uniprot of all 1441 missing IDs, and got zero results for them in their database. Furthermore, we did a spot check by searching every 100 IDs or so in the Uniprot KB and found that none of the IDs we searched exist. | ||
+ | * We also searched for the 'F' IDs in our MOD and none of them exist in that either. | ||
+ | * After this analysis, we have come to the conclusion that these 1441 IDs can be safely ignored, since they do not exist in Uniprot. We will simply need to modify our code to account for the absence of an underscore, much like we did with Vibrio Cholerae. | ||
+ | * In class on 12/10/15, we worked on figuring out the corrections for our GenMAPP code and made a new dataset using the finished data from the GenMAPP users. | ||
+ | * Then we ran GenMAPP Finder and ran into an issue because we were using the wrong column. | ||
+ | * Our group met up on 12/12/15 and continued working. | ||
+ | * We were able to run MAPPFinder successfully and generated all of the necessary files for our deliverables. | ||
+ | * As of now, we are waiting to find out how to get a sample MAPP file of a relevant biological pathway. | ||
+ | * We are going to individually work on our presentation slides and go over the entire thing on Monday night. | ||
+ | |||
==Individual Assessment & Reflection== | ==Individual Assessment & Reflection== | ||
=== Statement of Work === | === Statement of Work === | ||
* Describe exactly what you did on the project. | * Describe exactly what you did on the project. | ||
+ | We first downloaded the UniProt XML proteome set file (UniProt release 2015_10), GO association file (GOA Proteome Sets 124), and the GO OBO-XML file (version 2015-11-01) on November 20, 2015. Next, we created a new database in PostgreSQL by executing the sql code taken from the sql folder of the latest GenMAPP Builder build. This code was run in PostgreSQL to create 167 empty tables. Now that we set the foundation for our database, we configured GenMAPP Builder to connect to our PostgreSQL database and imported the UniProt XML file, GOA file, and GO OBO-XML file using GenMAPP Builder. We were now able to export a GenMAPP gene database, making sure that it also exported all molecular function, cellular component, and biological process gene ontology terms. This process took one hour and 18 minutes. | ||
+ | |||
+ | Inspecting and validating our gene database was a long but significant process. Although we had successfully exported the database, it would mean nothing unless we verified that the data within the database was valid and accurate. The first check we made used the TallyEngine in GenMAPP Builder to record the number of records for UniProt and GO in the XML data and in the Postgres databases. The table (image X) we got from running TallyEngine confirmed that the XML and PostGres Ordered Locus counts were both 4196. The next check we made used the XMLPipeDB Match function to validate the results from the TallyEngine table. Initially, the regEx pattern we used only caught 4079 IDs because there were over 100 Ordered Locus names that contained the extra character ‘A’ in their ID. So we accounted for those IDs in the regEx pattern and got a count of 4207, which was 11 more than we were expecting. A quick look at the raw XML file told us that these 11 IDs were not picked up by the TallyEngine because they were missing gene tags, so the XMLPipeDB Match function recognized the pattern in another section that TallyEngine did not check. Our next check used an SQL query to validate the PostgreSQL database results from the TallyEngine. Using a regEx pattern similar to the one used for XMLPipeDB Match, we were able to get a confirmed count of 4196 IDs. Finally, we made a visual inspection of the gene database itself using Microsoft Access. We checked the UniProt, RefSeq, and OrderedLocusNames tables to make sure all of the IDs were in the correct form and found that there were no discrepancies. | ||
+ | |||
+ | To come to a logical conclusion in regards to the 11 IDs that were missing a gene tag in the XML file, we simply searched for each ID on the UniProt website and found that they were part of the "STRING" protein-protein interaction database. This meant that we could safely ignore these IDs in our database. | ||
* Provide references or links to artifacts of your work, such as: | * Provide references or links to artifacts of your work, such as: | ||
** Wiki pages | ** Wiki pages | ||
+ | ***[[Heavy Metal HaterZ Deliverables]] | ||
+ | ***[[Gene Database Testing Report - Heavy Metal HaterZ]] | ||
** Other files or documents | ** Other files or documents | ||
+ | *** All relevant files can be found on our deliverables page. | ||
** Code or scripts | ** Code or scripts | ||
+ | *** The match command and SQL query can be found in our Gene Database Testing Report. | ||
=== Assessment of Project === | === Assessment of Project === | ||
* Give an objective assessment of the success of your project workflow and teamwork. | * Give an objective assessment of the success of your project workflow and teamwork. | ||
+ | ** I liked the way our team worked together on this project, mostly because of the fact that we got along well. It was easy to ask for help, and we communicated effectively, which made the entire process much smoother. We had similar ways of working, and it was fun to go through this project with them. | ||
* What worked and what didn't work? | * What worked and what didn't work? | ||
+ | ** Overall, it was a solid project; we didn't run into too many issues, and when we did, they were easily overcome. I would say two of the biggest bumps were the "missing" 11 IDs that were found by XMLPipeDB Match and the "extra" ~1400 IDs that were present in the microarray data but not in UniProt. These issues were overcome by consulting our professors and doing some extra research. | ||
* What would you do differently if you could do it all over again? | * What would you do differently if you could do it all over again? | ||
+ | ** I would probably want to work with my group in person more often than we did, since I found that we were much more effective and productive when all four of us were together working on a part of the project. | ||
* Evaluate the Gene Database Project and Group Report in the following areas: | * Evaluate the Gene Database Project and Group Report in the following areas: | ||
− | *# Content: What is the quality of the work? | + | *# Content: What is the quality of the work? |
+ | *#* I would say the level of quality is above average, because we spent a good amount of time making sure that there were no errors in our project, and if there were, we would have an explanation for why it existed. Our group report reflects the time and effort we poured into our gene database project, and so I would say it is quality work. | ||
*# Organization: Comment on the organization of the project and of your group's wiki pages. | *# Organization: Comment on the organization of the project and of your group's wiki pages. | ||
+ | *#* From the very beginning, we made sure to have a place on our wiki where all of our files would be consolidated. This made working in separate areas a breeze, and after we had made some progress, we were able to put our final files into our deliverables page. The wiki template we made for our team is organized and easy to navigate, and we used it on each of our auxiliary pages. | ||
*# Completeness: Did your team achieve all of the project objectives? Why or why not? | *# Completeness: Did your team achieve all of the project objectives? Why or why not? | ||
+ | *#* Yes. Completing the project objectives was not a particularly difficult task because our team worked well together. We all knew what roles we had, so there was no problem in getting the tasks finished. | ||
=== Reflection on the Process === | === Reflection on the Process === | ||
Line 22: | Line 54: | ||
* What did you learn? | * What did you learn? | ||
** With your head (biological or computer science principles) | ** With your head (biological or computer science principles) | ||
+ | *** I learned much more about the inner workings of database management and queries. I also learned quite a bit about the importance of microarray analyses and how they can be used to discover more about an organism like Shewanella oneidensis. | ||
** With your heart (personal qualities and teamwork qualities that make things work or not work)? | ** With your heart (personal qualities and teamwork qualities that make things work or not work)? | ||
+ | *** I learned that a team that gets along well will be much more effective, because it is easier to take and give from one another in the sense that we are comfortable discussing and helping in a work environment. I found that having each team member have a specific role makes splitting up the work much easier, since everyone knows his or her role and responsibility. | ||
** With your hands (technical skills)? | ** With your hands (technical skills)? | ||
+ | *** Like I mentioned earlier, I learned about SQL and how powerful databases can be when you know what you are doing. I also learned about XML files and how data can be efficiently stored and accessed using that type of formatting. | ||
* What lesson will you take away from this project that you will still use a year from now? | * What lesson will you take away from this project that you will still use a year from now? | ||
+ | ** I will be taking a databases class next year, so I will definitely be using the knowledge I gained in this class regarding database management for that purpose. I will also use my introductory knowledge of wiki editing and formatting to possibly contribute to wikis in the future. | ||
{{Template:Journal Template}} | {{Template:Journal Template}} |
Latest revision as of 21:50, 18 December 2015
Contents
Log
- Ran some incomplete statistical analysis data from the GenMAPP users through creating a new expression dataset and generated an exception file which found some issues with our database.
- The first time we ran it, there were exceptions for every single gene, because we did not compensate for the underscore in the ID. After inserting the underscore after the 'SO', we were able to find the actual errors.
- First of all, there are 5408 genes listed in their data, compared to the 4196 genes we have in our database.
- There are 760 gene IDs that are in the form SO_####F, which are genes that don't exist in our database.
- There are 681 gene IDs that are in a 'normal' form (either SO_#### or SO_A####) but do not exist in our database.
- For some of the gene IDs that have 'F's, there are multiple genes of the same ID.
- We attempted to do a batch search on Uniprot of all 1441 missing IDs, and got zero results for them in their database. Furthermore, we did a spot check by searching every 100 IDs or so in the Uniprot KB and found that none of the IDs we searched exist.
- We also searched for the 'F' IDs in our MOD and none of them exist in that either.
- After this analysis, we have come to the conclusion that these 1441 IDs can be safely ignored, since they do not exist in Uniprot. We will simply need to modify our code to account for the absence of an underscore, much like we did with Vibrio Cholerae.
- In class on 12/10/15, we worked on figuring out the corrections for our GenMAPP code and made a new dataset using the finished data from the GenMAPP users.
- Then we ran GenMAPP Finder and ran into an issue because we were using the wrong column.
- Our group met up on 12/12/15 and continued working.
- We were able to run MAPPFinder successfully and generated all of the necessary files for our deliverables.
- As of now, we are waiting to find out how to get a sample MAPP file of a relevant biological pathway.
- We are going to individually work on our presentation slides and go over the entire thing on Monday night.
Individual Assessment & Reflection
Statement of Work
- Describe exactly what you did on the project.
We first downloaded the UniProt XML proteome set file (UniProt release 2015_10), GO association file (GOA Proteome Sets 124), and the GO OBO-XML file (version 2015-11-01) on November 20, 2015. Next, we created a new database in PostgreSQL by executing the sql code taken from the sql folder of the latest GenMAPP Builder build. This code was run in PostgreSQL to create 167 empty tables. Now that we set the foundation for our database, we configured GenMAPP Builder to connect to our PostgreSQL database and imported the UniProt XML file, GOA file, and GO OBO-XML file using GenMAPP Builder. We were now able to export a GenMAPP gene database, making sure that it also exported all molecular function, cellular component, and biological process gene ontology terms. This process took one hour and 18 minutes.
Inspecting and validating our gene database was a long but significant process. Although we had successfully exported the database, it would mean nothing unless we verified that the data within the database was valid and accurate. The first check we made used the TallyEngine in GenMAPP Builder to record the number of records for UniProt and GO in the XML data and in the Postgres databases. The table (image X) we got from running TallyEngine confirmed that the XML and PostGres Ordered Locus counts were both 4196. The next check we made used the XMLPipeDB Match function to validate the results from the TallyEngine table. Initially, the regEx pattern we used only caught 4079 IDs because there were over 100 Ordered Locus names that contained the extra character ‘A’ in their ID. So we accounted for those IDs in the regEx pattern and got a count of 4207, which was 11 more than we were expecting. A quick look at the raw XML file told us that these 11 IDs were not picked up by the TallyEngine because they were missing gene tags, so the XMLPipeDB Match function recognized the pattern in another section that TallyEngine did not check. Our next check used an SQL query to validate the PostgreSQL database results from the TallyEngine. Using a regEx pattern similar to the one used for XMLPipeDB Match, we were able to get a confirmed count of 4196 IDs. Finally, we made a visual inspection of the gene database itself using Microsoft Access. We checked the UniProt, RefSeq, and OrderedLocusNames tables to make sure all of the IDs were in the correct form and found that there were no discrepancies.
To come to a logical conclusion in regards to the 11 IDs that were missing a gene tag in the XML file, we simply searched for each ID on the UniProt website and found that they were part of the "STRING" protein-protein interaction database. This meant that we could safely ignore these IDs in our database.
- Provide references or links to artifacts of your work, such as:
- Wiki pages
- Other files or documents
- All relevant files can be found on our deliverables page.
- Code or scripts
- The match command and SQL query can be found in our Gene Database Testing Report.
Assessment of Project
- Give an objective assessment of the success of your project workflow and teamwork.
- I liked the way our team worked together on this project, mostly because of the fact that we got along well. It was easy to ask for help, and we communicated effectively, which made the entire process much smoother. We had similar ways of working, and it was fun to go through this project with them.
- What worked and what didn't work?
- Overall, it was a solid project; we didn't run into too many issues, and when we did, they were easily overcome. I would say two of the biggest bumps were the "missing" 11 IDs that were found by XMLPipeDB Match and the "extra" ~1400 IDs that were present in the microarray data but not in UniProt. These issues were overcome by consulting our professors and doing some extra research.
- What would you do differently if you could do it all over again?
- I would probably want to work with my group in person more often than we did, since I found that we were much more effective and productive when all four of us were together working on a part of the project.
- Evaluate the Gene Database Project and Group Report in the following areas:
- Content: What is the quality of the work?
- I would say the level of quality is above average, because we spent a good amount of time making sure that there were no errors in our project, and if there were, we would have an explanation for why it existed. Our group report reflects the time and effort we poured into our gene database project, and so I would say it is quality work.
- Organization: Comment on the organization of the project and of your group's wiki pages.
- From the very beginning, we made sure to have a place on our wiki where all of our files would be consolidated. This made working in separate areas a breeze, and after we had made some progress, we were able to put our final files into our deliverables page. The wiki template we made for our team is organized and easy to navigate, and we used it on each of our auxiliary pages.
- Completeness: Did your team achieve all of the project objectives? Why or why not?
- Yes. Completing the project objectives was not a particularly difficult task because our team worked well together. We all knew what roles we had, so there was no problem in getting the tasks finished.
- Content: What is the quality of the work?
Reflection on the Process
- What did you learn?
- With your head (biological or computer science principles)
- I learned much more about the inner workings of database management and queries. I also learned quite a bit about the importance of microarray analyses and how they can be used to discover more about an organism like Shewanella oneidensis.
- With your heart (personal qualities and teamwork qualities that make things work or not work)?
- I learned that a team that gets along well will be much more effective, because it is easier to take and give from one another in the sense that we are comfortable discussing and helping in a work environment. I found that having each team member have a specific role makes splitting up the work much easier, since everyone knows his or her role and responsibility.
- With your hands (technical skills)?
- Like I mentioned earlier, I learned about SQL and how powerful databases can be when you know what you are doing. I also learned about XML files and how data can be efficiently stored and accessed using that type of formatting.
- With your head (biological or computer science principles)
- What lesson will you take away from this project that you will still use a year from now?
- I will be taking a databases class next year, so I will definitely be using the knowledge I gained in this class regarding database management for that purpose. I will also use my introductory knowledge of wiki editing and formatting to possibly contribute to wikis in the future.
Individual Journal Entries
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- Week 13
- Week 14
- Week 15