Difference between revisions of "Week 6"

From LMU BioDB 2013
Jump to: navigation, search
(Initial transcription of this page.)
 
(Buddy System: Fix some wiki markup.)
Line 10: Line 10:
  
 
This week’s pairings are:
 
This week’s pairings are:
* 'To be determined'
+
* ''To be determined''
  
 
=== Movies and Ratings ===
 
=== Movies and Ratings ===

Revision as of 06:26, 2 July 2013

This journal entry is due on Friday, October 4, at midnight PDT. (Thursday night/Friday morning)

Contents

Individual Journal Assignment

  • Store this journal entry as "username Week 6" (i.e., this is the text to place between the square brackets when you link to this page).
  • Link from your user page to this Assignment page.
  • Link to your journal entry from your user page.
  • Link back from your journal entry to your user page.
  • Don't forget to add the "Journal Entry" category to the end of your wiki page.
    • Note: you can easily fulfill all of these links by adding them to your template and then using your template on your journal entry.

Buddy System

For this exercise, you have been assigned to a “journal buddy;” you should work on the exercises together.

  • Plan at least one “live” work session, where you and your journal buddy can interactively talk, think, and work through the exercises.
  • While you may work with your buddy to figure out solutions to the exercises, you should still answer the reflection questions and write up your journal pages individually. In particular, any explanations, issues, or “think-aloud” comments should be in your own words.

This week’s pairings are:

  • To be determined

Movies and Ratings

Connect to the kerfuffle database within the Keck lab infrastructure and answer the following questions. Provide both the answer and the SQL query used to determine that answer:

  1. How many movies were released before 1915?
  2. How many movies were released in the 1940s (years 1940 to 1949)? (don’t answer this question by listing the movies then counting the results!)
  3. What movies from the 1980s had 4-digit numbers in their titles?
  4. How many of each rating did the movie Citizen Kane get?
  5. How many of each rating did the movie Plan 9 from Outer Space get?
  6. How many movies has member number 42 rated as 1, 2, 3, 4, and 5, respectively?
  7. Which movies, in alphabetical order, did member number 42 rate as a “5”?
  8. Which movies with “Matter” in the title got a “1” rating, and how many “1”s did they each get?
    Be prepared for lengthy query times for these last two questions — there’s a lot of information to sort through!
  9. Which movie released in 1940 got the most “5” ratings, and how many did it get?
  10. Which movie in the entire database has gotten the most “1” ratings? “5” ratings?

Back to sed and Biology

Now, let’s apply all of this to biological data. Integr8 is a query interface for the UniProt biological database. Available from this database is a download for GO (Gene Ontology) annotations, or GOA for short, whose format is documented here.

The GOA file for V. cholerae has been downloaded to ~xmlpipedb/data for you (its exact filename is 46.V_cholerae_ATCC_39315.goa). Using the documentation in http://www.geneontology.org/GO.format.annotation.shtml, do the following:

  1. Specify the create table command that defines a table equivalent for GOA data, and create such a table in Kerfuffle. Call the table yourKeckUsername_goa. Some notes:
    • Define all fields as varchar
    • Include all fields, including those described as “optional”
    • Make all field/column names completely lowercase
    • When a field name has a space or a colon, substitute them with an underscore (_) (e.g., GO ID becomes go_id; DB:Reference becomes db_reference)
    • Treat the multivalued fields (i.e., the ones separated by | within the text) as a single value
    • Don’t worry about defining a primary key
  2. What series of sed commands will transform the 46.V_cholerae_ATCC_39315.goa file into a sequence of insert commands that successfully load the contents of that file into the goa table in your private database? Some notes:
    • Use \t in your sed commands to represent the tab characters that separate the fields of the GOA file.
    • When a field is described as “optional,” that simply means there may be no value for it. However, there will still be tab characters before and after it.
    • Since a single GOA records has 15 fields, there will be a lot of typing involved. However, this is solely due to the number of fields; conceptually, this exercise is the same as creating a table with just two fields.

An “Xploratory” Question

Finally, think about and answer this question (no actual commands involved, for now): Importing the .goa file (and some others) is fairly natural, since it is already tabular by nature. Think about and describe an approach that you might take if asked to import an XML file, which, we have seen, is structured more like an outline, into a relational database. Write your answer in terms of how the relational tables would be defined, what fields would comprise primary and foreign keys, and what might need to be done to the information in the XML file.

Shared Journal Assignment

  • Store your journal entry in the shared Class Journal Week 6 page. If this page does not exist yet, go ahead and create it (congratulations on getting in first :) )
  • Link to your journal entry from your user page.
  • Link back from the journal entry to your user page.
    • NOTE: you can easily fulfill the links part of these instructions by adding them to your template and using the template on your user page.
  • Sign your portion of the journal with the standard wiki signature shortcut (~~~~).
  • Add the "Journal Entry" and "Shared" categories to the end of the wiki page (if someone has not already done so).

Reflect

After completing the both exercises, answer the following questions on the shared Class Journal Week 6 page:

  1. Has your view of digital data changed since the beginning of the semester? Why or why not?
  2. How confident do you feel about exploring and manipulating data, in various forms, after the past few weeks’ assignments?
  3. Consider movie query #7: if you looked at all of the movies that a particular member has rated from 1 to 5, do you think you can get a general idea of that member’s likes and dislikes? Do you see any issues with having this kind of information available to companies?
  4. Do you think it would have been easier to answer the movie questions using the plain text files in the ~xmlpipedb/data folder coupled with grep and wc? Why or why not?
Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox