Difference between revisions of "Week 5"
Kdahlquist (Talk | contribs) (note on grading for assignment) |
Kdahlquist (Talk | contribs) (→Reflect: pasted in notes from BioQuest 2011 workshop KD attended) |
||
Line 130: | Line 130: | ||
# What was most interesting to you in this week's exercise (SWISS-PROT/UniProt or NAR)? Why? | # What was most interesting to you in this week's exercise (SWISS-PROT/UniProt or NAR)? Why? | ||
# What was least interesting? Why? | # What was least interesting? Why? | ||
+ | |||
+ | <!-- | ||
+ | Databases and Data Formats | ||
+ | Understand how to query relational databases, and be familiar with data types and formats for the discipline. | ||
+ | |||
+ | • In Phase 2 of the course, students learn how to query the PostgreSQL relational database from the command line, practicing with the Netflix movie database (fields, records, keys, select queries) | ||
+ | • In Phase 3 of the course, students use XMLPipeDB to load data into a PostgreSQL database and then export it to an MS Access formatted database. They use queries of the PostgreSQL database for quality assurance. | ||
+ | • Data types: UniProt XML, GOA (tab-delimited text), GO XML, microarray data (numerical, tab-delimited text) | ||
+ | • Formats: .txt, .xml, .zip, .doc, .pdf, .jpg, .gdb, .gex, .mapp, .xls, .exe, .jar (maybe more) | ||
+ | |||
+ | Discovery and Acquisition of Data | ||
+ | Locate and utilize disciplinary data repositories, and identify appropriate data sources | ||
+ | • Phase 2: do journal club and database exploration exercise based on a biological database from the Nucleic Acids Research annual database issue from the previous January. | ||
+ | • Phase 3: microarray databases (ArrayExpress, GEO, Stanford Microarray Database), Integr8, UniProt, GO | ||
+ | |||
+ | Data Management and Organization | ||
+ | Understand the lifecycle of data, and use data management plans to track subsets of processed data. | ||
+ | • Phase 3, need to track versioning of original data sources for database (UniProt XML, GO, GOA), need to track versioning of processed microarray data. Groups are asked to come up with a file naming convention and create a page on their group wiki for the files to be stored and linked to. | ||
+ | |||
+ | Data Conversion and Interoperability | ||
+ | Migrate data from one format to another, and understand the benefits of standard data formats. | ||
+ | • This is what XMLPipeDB is for! | ||
+ | |||
+ | Quality Assurance | ||
+ | Use metadata and screening procedures to recognize artifacts, incompletion, or corruption of data sets. | ||
+ | • Phase 3: One team member is designated the quality assurance officer and is responsible for showing that all of the data from the input files was found in the PostgreSQL intermediate database and then in the GenMAPP Gene Database. S/he uses XMLPipeDB match on the command line for the XML data, queries of PostgreSQL database, and visual inspection of MS Access-formatted .gdb. The QA person also has to compare the data to an outside resource (not UniProt, usually the model organism database) and determine what was in this other resource that was not in UniProt and vice versa. They can also compare to what was on the microarray. | ||
+ | |||
+ | Metadata | ||
+ | Interpret metadata from external sources, and annotate data so it can be used by external users. | ||
+ | • Phase 2: when students do the NAR exercise, they might be evaluating some of this, but could be more explicit. | ||
+ | • We don’t do annotation, but we talk about it; there is probably some way we could fit it in. | ||
+ | |||
+ | Data Curation and Re-use | ||
+ | Recognize the role of curation throughout the data lifecycle in its value in effective reuse of data. | ||
+ | • Phase 2: when students do the NAR exercise, they evaluate this for the database they report on | ||
+ | • Phase 3: inevitably, we encounter problems when trying to use the microarray data, we could foreground this there | ||
+ | |||
+ | Cultures of Practice | ||
+ | Know the practices, values, and norms of discipline as they relate to managing, sharing, and curating data. | ||
+ | • Phase 2: introduced when introduced to the NAR exercise | ||
+ | • Could be made more explicit than just telling them about it. | ||
+ | |||
+ | Data Preservation | ||
+ | Understand the technology, resource, and organizational components of preserving data. | ||
+ | • Introduced a little with the NAR exercise, could be made more explicit | ||
+ | |||
+ | Data Analysis | ||
+ | Understand the basic analysis tools of their discipline including workflow management tools. | ||
+ | • It would be great to add some workflow management tools, right now it is in just a simple diagram for the flow of the overall project (see below). It is badly needed for the microarray analysis portion. | ||
+ | |||
+ | |||
+ | Data Visualization | ||
+ | Use visualization tools of discipline, and understand the advantages of the different types of visualization. | ||
+ | • GenMAPP and MAPPFinder are used to visualize the microarray data. Could possibly show some other methods with the microarray data, but things are really jam-packed already. | ||
+ | |||
+ | Ethics, including citation of data | ||
+ | Understand intellectual property, privacy, and the ethos of the discipline around sharing and citing data. | ||
+ | • Phase 2/3: They are introduced to open source licensing and we ask them to evaluate this in the NAR exercise. We could really use a case study here like the one I developed for the Hwang stem cell fraud case. | ||
+ | --> |
Revision as of 17:19, 17 September 2013
Under Construction
The content in this page has not been finalized and is still subject to change. Use the current information at your own risk.
The individual journal entry (UniProt exercise) is due on Friday, September 27, at midnight PDT. (Thursday night/Friday morning)
The shared journal entry, database wiki page, and PowerPoint slides for your presentation are due on Tuesday, October 1, at midnight PDT. (Monday night/Tuesday morning)
A note on the grading for this assignment:
- The individual journal entry, and shared journal entries are worth a total of 10 points. Students will be graded on an individual basis for this portion of the assignment.
- The database wiki page and presentation is worth a total of 20 points; each member of the group will receive the same grade for this portion of the assignment.
Individual Journal Assignment
- Store this journal entry as "username Week 5" (i.e., this is the text to place between the square brackets when you link to this page).
- Link from your user page to this Assignment page.
- Link to your journal entry from your user page.
- Link back from your journal entry to your user page.
- Don't forget to add the "Journal Entry" category to the end of your wiki page.
- Note: you can easily fulfill all of these links by adding them to your template and then using your template on your journal entry.
UniProt Exercise
For this exercise, you will read and follow the links in Chapter 4: Using Protein and Specialized Sequence Databases of the book Bioinformatics for Dummies.
- Since the publication of this book in 2003, the SWISS-PROT database has become the UniProt Knowledgebase. The underlying data are the same, but the scope and user interface for the database have been updated. Thus, some of the exact instructions of the chapter have to be changed to reflect the change to UniProt. These changes are noted below by page number.
- Page 123:
- The URL for the SWISS-PROT/UniProt server is http://www.expasy.org/sprot/.
- The Quick Search field is now found at the upper right of the page.
- Choose "UniProtKB" from the drop-down menu (it is the default), and click the "GO" button.
- Alternately, you can go directly to http://www.uniprot.org, the search field is in the top middle of the page.
- The information described in subsequent pages can all be found, but will be in a different order on the page. There is a set of navigation links near the top of the page to help you jump to each section.
- General information about the entry (bottom of page 123):
- This information is found under the header "Entry information" and is near the bottom of the web page, instead of the top.
- Name and origin of the protein (page 124) is near the top of the page.
- The References (page 126) are near the middle of the page.
- The Comments (page 126) is now known as "General annotation (comments)".
- The Cross-Refernces (page 128) are even more extensive and are organized by sub-categories of databases.
- In particular, click on a sample cross-reference link for each of the following databases, and for each, state what type of information is found there:
- EMBL
- InterPro
- PDB
- Pfam
- RefSeq
- GeneID
- In particular, click on a sample cross-reference link for each of the following databases, and for each, state what type of information is found there:
- The Keywords (page 130) are now found listed under "Ontologies".
- The Features (page 131) are now listed as "Sequence annotation (Features)".
- In the section "Finding Out More about Your Protein" (page 135-139), some of the databases are defunct, highlighting how biological databases are a moving target (this book was first published in 2003).
- A new feature of the UniProt interface is that you can view the data in several different formats. Click on the buttons on the top-right of the page to view the data as:
- TXT: flat file text data, the original format of the SWISS-PROT data (even before it was put in a relational database)
- XML: text data structured with tags (like you praacticed with for last week's assignment)
- RDF/XML: a semantic web format
- GFF: a specialized format for genomic information
- FASTA: a basic text format for sequence information
- Write a one-paragraph summary of what you have learned about the human EGFR protein from this exercise.
- Reflect and answer the following questions:
- What was the purpose of this exercise?
- What did I learn from this exercise?
- What did I not understand (yet) about this exercise?
Additional UniProt Resources
- UniProt NAR Database Issue 2010 article
- YouTube video tutorial about UniProt (8:17 minutes)
- EBI guide to interpreting an UniProt record
- UniProt demo from UniProt itself
NAR Exercise and Presentation
Each year, the journal Nucleic Acids Research (NAR) devotes the first issue in January to biological databases. The goal of this assignment is to dive into the deep end of the pool and experience the breadth and depth of biological databases available on the Web:
- Read (if you haven't already done so): Introduction to NAR Database Issue
- Choose your database:
- Nucleic Acids Research Database Issue Table of Contents 2013
- Nucleic Acids Research Database Issue Database List
- Note: make sure that the database you choose has a corresponding paper in the 2013 issue.
For this exercise, you will work with an assigned buddy. Choose a database from this issue and answer the following questions about that database. Each pair should choose a different database to profile. So, to claim your first choice, go to the Class Journal Week 5 page and stake your claim to a database. When you are choosing your database, look at the other students' entries to make sure you are not doing the same one. The buddy assignments are:
- To be determined
Database Wiki Page
For your assignment, create a new wiki page to profile your database. There will be one page per group; both partners will contribute to the same page.
- Link to your database page from the Class Journal Week 5 page. These pages will be a resource for the class as we move forward with this unit of the course.
- Link to your database page from your user page.
- Link from your database page to the Class Journal Week 5 page.
- Link from your database page to your user pages.
Read the article about the database from the Nucleic Acids Research journal and then go online to the database itself. When you answer the questions below, provide a hyperlink to the page that you got the information from.
- What database did you access? (link to the home page of the database)
- What is the purpose of the database?
- What biological information does it contain?
- What species are covered in the database?
- What biological questions can it be used to answer?
- What type (or types) of database is it (sequence, structure model organism, or specialty [what?]; primary or “meta”; curated electronically, manually [in-house], manually [community])?
- What individual or organization maintains the database?
- What is their funding source(s)?
- Is there a license agreement or any restrictions on access to the database?
- How often is the database updated? When was the last update?
- Are there links to other databases?
- Can the information be downloaded?
- In what file formats?
- Evaluate the “user-friendliness” of the database.
- Is the Web site well-organized?
- Does it have a help section or tutorial?
- Run a sample query. Do the results make sense?
Some Definitions
- Electronic curation occurs when someone writes a program to add information to a database record from another database.
- Manual curation occurs when a human reviews the information being added to a record to validate it as true.
- In-house is when the human works for the database organization.
- Community is when the database allows members of the scientific community that don't work for the database organization to add information to the record.
PowerPoint Presentation
Each group will prepare and give a 10-15 minute PowerPoint presentation based on their chosen database.
- Four groups will present on Tuesday 10/1 and three groups will present on Thursday, 10/3. The order of presentations will be determined in class on Thursday 9/26.
- Please follow the Presentation Guidelines for how to format your slides.
- You will need to prepare ~10-15 slides (assume 1 slide per minute of presentation).
- You need to present the information you gathered about your database that you listed in your wiki above, but organized as a presentation.
- You may give a live demo of the database if you wish, but practice carefully so that you can do the presentation in 15 minutes.
- Alternately, you may choose to show screen shots instead of the live demo.
- Your PowerPoint slides must be uploaded to the wiki page you created for your database, by midnight Monday/Tuesday, even if your group is scheduled to present on Thursday.
- You can update your slides before your presentation, but we will be grading the ones you upload by the deadline.
- Your presentation (both the slides and the oral presentation) will be evalutated by the instructors using the guidelines shown here.
- Your presentation will also be evaluated by your fellow classmates (anonymously) who will answer the following questions:
- What is the speaker's take-home message (one short sentence)?
- What are the best points about the presentation's content, organization, clarity of visuals, and presentation style? Please give at least 2 specific examples.
- What points need improvement? How would you improve them? Please give at least 2 specific examples.
- Store your journal entry in the shared Class Journal Week 5 page. If this page does not exist yet, go ahead and create it (congratulations on getting in first :) )
- Link to your journal entry from your user page.
- Link back from the journal entry to your user page.
- NOTE: you can easily fulfill the links part of these instructions by adding them to your template and using the template on your user page.
- Sign your portion of the journal with the standard wiki signature shortcut (
~~~~
). - Add the "Journal Entry" and "Shared" categories to the end of the wiki page (if someone has not already done so).
Reflect
After completing the both exercises, answer the following questions on the shared Class Journal Week 5 page:
- What was the most beneficial aspect of working with a buddy on this assignment (other than what you answered last week)?
- What was the most challenging aspect of working with a buddy on this assignment (other than what you answered last week)?
- What was most interesting to you in this week's exercise (SWISS-PROT/UniProt or NAR)? Why?
- What was least interesting? Why?