Rlegaspi Week 6
Contents
The FDA Drug Database
The United States Food and Drug Administration (FDA) provides, as a matter of public record, the full data set for its approved drugs at this website: http://www.fda.gov/Drugs/InformationOnDrugs/ucm079750.htm
The site includes both a link to the downloadable files (compressed in .zip format) and the schema (“entity relationship diagram”) for those files.
Using these files, what you have learned about sed
and SQL thus far, and additional information found in this wiki and on the aforementioned FDA website, do the following:
- Download and uncompress the files.
- Define appropriate tables for the Application and Product entities.
- Process the data files for these entities then load them into those tables.
- Answer the questions below.
Downloading and Unzipping FDA Drug Information File
- Like any other assignment using Putty, I logged through the Putty application on one of the SEA 120 computers; therefore, I did not need to use the command
ssh rlegaspi@my.cs.lmu.edu<\/code> and type in my password.
- I bypassed the website and downloaded the file directly to my.cs.lmu.edu using the following command provided on the Week 6 Assignment Page:
curl -O http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zip
- I unzipped the file, using this command (also provided on the Week 6 Assignment Page):
unzip UCM054599.zip
- Upon successful unzipping, you should see the following files:
$ ls AppDoc.txt application.txt DocType_lookup.txt Product.txt ReviewClass_Lookup.txt AppDocType_Lookup.txt ChemTypeLookup.txt Product_tecode.txt RegActionDate.txt UCM054599.zip
- As stated in the tasks above, you will only need to work with the application.txt and Product.txt files.
Part of the assignment, of course, is to determine how to rework these files so that their data can be loaded into PostgreSQL via the pgAdmin III desktop application on the Seaver 120 workstations. Just as with the movie_titles.txt file, you do this by “redirecting” the processed data to your ~/public_html folder:
cat application.txt | your command sequence here > ~/public_html/application.sql.txt cat Product.txt | your command sequence here > ~/public_html/Product.sql.txt
You can then download these files to the Seaver 120 desktop environment by visiting these sites with a web browser:
- http://my.cs.lmu.edu/~username/application.sql.txt
- http://my.cs.lmu.edu/~username/Product.sql.txt
Again, note that these latter steps exactly mirror the tutorial for the movie_titles.txt file, so please refer to that wiki page for additional details.
Supplementary Information
Just as with Week 4, real-world data have their share of variations and exceptions. For these files:
- To visualize the raw data more easily, you can load the files into Microsoft Excel. But do not use Excel to process them into SQL statements.
- A <code>bit column can either be a
boolean
orint
data type in PostgreSQL—examine the data to see what values are in there in order to make the right choice. - If you see “nulls” in the website schema, that simply means that the value can be empty.
- For our purposes, you don’t need to define a primary key for the Product table. It does, however, have a foreign key.
- The first lines of all of these files contain the column names. You don’t need them because the SQL
insert
statement, as shown in class and in the PostgreSQL Tutorial, specifies these already. - Instead of commas as in the movie_titles.txt file, the columns in these files are separated by tabs. You can indicate a tab in
sed
using the characters\t
. - Due to the way these files are formatted, the “end of the line” should be designated using the pattern
\r$
and not the dollar sign by itself. - PostgreSQL can handle numbers and booleans with or without single quotes, so both
5
and'5'
are valid. In addition, boolean (true/false) values are case-insensitive, so'True'
and'False'
will be acceptable to the system.
Questions to Answer
- Provide the DDL (
create table
) statements that you used for your application and product tables. - Provide the
sed
command sequences that you used to convert the raw text files into sequences of SQLinsert
statements. - Using the command line, how can you determine the number of records in each file? Provide the command.
- Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL
select
statement. - In your database, are these numbers the same or different? Explain why you think so.
- For the remaining questions, supply (a) the answer to the question and (b) the SQL statement that you used to answer the question.
- What are the names of the drug products that are administered in the form
INJECTABLE;INTRAVENOUS, SUBCUTANEOUS
? - What are the names of the drug products whose active ingredient (activeingred) is
ATROPINE
? - In what forms and dosages can the drug product named
BENADRYL
be administered? - Which drug products have a name ending in
ESTROL
? - Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)
- Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)
- What are the names of the drug products that are sponsored (sponsor applicant column) by
MERCK
? - Which sponsor applicant companies have the text
LABS
in their names and have products whose active ingredients (activeingred) include bothASPIRIN
andCAFFEINE
?
- Store your journal entry in the shared Class Journal Week 6 page. If this page does not exist yet, go ahead and create it (congratulations on getting in first :) )
- Link to your journal entry from your user page.
- Link back from the journal entry to your user page.
- NOTE: you can easily fulfill the links part of these instructions by adding them to your template and using the template on your user page.
- Sign your portion of the journal with the standard wiki signature shortcut (
~~~~
). - Add the "Journal Entry" and "Shared" categories to the end of the wiki page (if someone has not already done so).
Read
- This week we return to “What is Code?”, now looking at these sections within Section 5, “The Time You Attended the E-mail Address Validation Meeting:”
- 5.1 What is the Relationship Between Code and Data?
- 5.2 Where Does Data Live?
- 5.3 The Language of White Collars
(feel free to read the other parts of Section 5; we focus on these three because these relate most directly to this assignment, with the third looking to the future...of this semester)
Reflect
- Based on what you have seen of the FDA database, do you feel that you have a better understanding of how the data mentioned in section 5.1—Spotify music, Fitbit exercise tracking, Twitter tweets, IRS tax returns, etc.—might look when stored on a computer?
- Section 5.2 half-jokingly says that, by building a bookstore, you actually built the death of bookstores. You just built a drug database—have you actually built the death of pharmacies? What do you think of this analogy?
- Section 5.3 says that the Java language can “talk to a database.” But you just “talked” to a database in this assignment. Why do you think you would need a programming language to do the talking?