Difference between revisions of "Week 6"

From LMU BioDB 2013
Jump to: navigation, search
(Movies and Ratings: Add reference to insertion instructions.)
(Movies from Text File to Tables: Adjust questions to use just the movie title table.)
Line 16: Line 16:
 
=== Movies from Text File to Tables ===
 
=== Movies from Text File to Tables ===
  
Follow the instructions in the [[PostgreSQL Tutorial]] to load up your own copy of the sample movie database.  To get "warmed up," complete the tutorial so that you get some query practice before working on the questions below.
+
Follow the instructions in the [[PostgreSQL Tutorial]] to load up your own copy of the sample movie table.  To get "warmed up," complete the tutorial so that you get some query practice before working on the questions below.
  
 
Provide '''both''' the answer and the SQL query used to determine that answer:
 
Provide '''both''' the answer and the SQL query used to determine that answer:
Line 22: Line 22:
 
# How many movies were released in the 1940s (years 1940 to 1949)?  (''don’t'' answer this question by listing the movies then counting the results!)
 
# How many movies were released in the 1940s (years 1940 to 1949)?  (''don’t'' answer this question by listing the movies then counting the results!)
 
# What movies from the 1980s had 4-digit numbers in their titles?
 
# What movies from the 1980s had 4-digit numbers in their titles?
 +
# Form two of your own queries involving movie titles and years, then
 +
#* State your query in plain English (like the queries above).
 +
#* Provide the SQL query "translation" of your plain English query.
 +
#* Copy/paste the answer to your query, according to the sample movie table.
 
<!--
 
<!--
 
# How many of each rating did the movie ''Citizen Kane'' get?
 
# How many of each rating did the movie ''Citizen Kane'' get?

Revision as of 23:39, 24 September 2013

Under Construction

The content in this page has not been finalized and is still subject to change. Use the current information at your own risk.

This journal entry is due on Friday, October 4, at midnight PDT. (Thursday night/Friday morning)

Individual Journal Assignment

  • Store this journal entry as "username Week 6" (i.e., this is the text to place between the square brackets when you link to this page).
  • Link from your user page to this Assignment page.
  • Link to your journal entry from your user page.
  • Link back from your journal entry to your user page.
  • Don't forget to add the "Journal Entry" category to the end of your wiki page.
    • Note: you can easily fulfill all of these links by adding them to your template and then using your template on your journal entry.

Buddy System

For this exercise, you have been assigned to a “journal buddy;” you should work on the exercises together.

  • Plan at least one “live” work session, where you and your journal buddy can interactively talk, think, and work through the exercises.
  • While you may work with your buddy to figure out solutions to the exercises, you should still answer the reflection questions and write up your journal pages individually. In particular, any explanations, issues, or “think-aloud” comments should be in your own words.

This week’s pairings are:

  • To be determined

Movies from Text File to Tables

Follow the instructions in the PostgreSQL Tutorial to load up your own copy of the sample movie table. To get "warmed up," complete the tutorial so that you get some query practice before working on the questions below.

Provide both the answer and the SQL query used to determine that answer:

  1. How many movies were released before 1915?
  2. How many movies were released in the 1940s (years 1940 to 1949)? (don’t answer this question by listing the movies then counting the results!)
  3. What movies from the 1980s had 4-digit numbers in their titles?
  4. Form two of your own queries involving movie titles and years, then
    • State your query in plain English (like the queries above).
    • Provide the SQL query "translation" of your plain English query.
    • Copy/paste the answer to your query, according to the sample movie table.

Back to sed and Biology

Now, let’s apply all of this to biological data. Integr8 is a query interface for the UniProt biological database. Available from this database is a download for GO (Gene Ontology) annotations, or GOA for short, whose format is documented here.

The GOA file for V. cholerae has been downloaded to ~xmlpipedb/data for you (its exact filename is 46.V_cholerae_ATCC_39315.goa). Using the documentation in http://www.geneontology.org/GO.format.annotation.shtml, do the following:

  1. Specify the create table command that defines a table equivalent for GOA data, and create such a table in Kerfuffle. Call the table yourKeckUsername_goa. Some notes:
    • Define all fields as varchar
    • Include all fields, including those described as “optional”
    • Make all field/column names completely lowercase
    • When a field name has a space or a colon, substitute them with an underscore (_) (e.g., GO ID becomes go_id; DB:Reference becomes db_reference)
    • Treat the multivalued fields (i.e., the ones separated by | within the text) as a single value
    • Don’t worry about defining a primary key
  2. What series of sed commands will transform the 46.V_cholerae_ATCC_39315.goa file into a sequence of insert commands that successfully load the contents of that file into the goa table in your private database? Some notes:
    • Use \t in your sed commands to represent the tab characters that separate the fields of the GOA file.
    • When a field is described as “optional,” that simply means there may be no value for it. However, there will still be tab characters before and after it.
    • Since a single GOA records has 15 fields, there will be a lot of typing involved. However, this is solely due to the number of fields; conceptually, this exercise is the same as creating a table with just two fields.

An “Xploratory” Question

Finally, think about and answer this question (no actual commands involved, for now): Importing the .goa file (and some others) is fairly natural, since it is already tabular by nature. Think about and describe an approach that you might take if asked to import an XML file, which, we have seen, is structured more like an outline, into a relational database. Write your answer in terms of how the relational tables would be defined, what fields would comprise primary and foreign keys, and what might need to be done to the information in the XML file.

Shared Journal Assignment

  • Store your journal entry in the shared Class Journal Week 6 page. If this page does not exist yet, go ahead and create it (congratulations on getting in first :) )
  • Link to your journal entry from your user page.
  • Link back from the journal entry to your user page.
    • NOTE: you can easily fulfill the links part of these instructions by adding them to your template and using the template on your user page.
  • Sign your portion of the journal with the standard wiki signature shortcut (~~~~).
  • Add the "Journal Entry" and "Shared" categories to the end of the wiki page (if someone has not already done so).

Reflect

After completing the both exercises, answer the following questions on the shared Class Journal Week 6 page:

  1. Has your view of digital data changed since the beginning of the semester? Why or why not?
  2. How confident do you feel about exploring and manipulating data, in various forms, after the past few weeks’ assignments?
  3. Consider movie query #7: if you looked at all of the movies that a particular member has rated from 1 to 5, do you think you can get a general idea of that member’s likes and dislikes? Do you see any issues with having this kind of information available to companies?
  4. Do you think it would have been easier to answer the movie questions using the plain text files in the ~xmlpipedb/data folder coupled with grep and wc? Why or why not?
Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox