Bklein7 Week 6
From LMU BioDB 2015
Revision as of 06:55, 12 October 2015 by Bklein7 (Talk | contribs) (Added information for creating the sed command for application.txt)
Contents
The FDA Drug Database
Acquiring Data From the Drugs@FDA ZIP File
Downloading and Viewing
- To download the zip file directly from the FDA website, I typed the following into the command line:
curl -O http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zip
- Next, I unziped the FDA file
unzip UCM054599.zip
- The resulting output included the application.txt and Product.txt files. To view the format of these files, I used more commands. The output I received appeared to have listed columns which were not clearly defined. Therefore, I visualized the data using Notepad. This made the columns easier to read.
- In Notepad, I got a better idea of how I needed to edit the txt files to import them into a table in pgAdminIII.
Editing the txt files with sed
- application.txt
- To begin, I wanted to determine if there were any special characters in the text file such as semicolons or apostrophes that needed to be worked around. I did two searches in notepad for these characters and found that this was not the case.
- Next, I began the pipe with a cat command and then wanted to clean up the file to make it easier to read.
- Data values were separated by tab delimiters on the FDA website. Therefore, I began by researching how to manipulate tabs on the command line. To figure out how to replace the tabs with commas, I did some google searching. I found a way to tackle this problem on electrictoolbox. Using this piece of information, I wrote a sed command to replace tabs with commas. This cleaned up excess spaces from the file, although not entirely.
- In reviewing the output, I found there to be multiple instances of sequences of 2 or 4 spaces. Therefore, I wrote a sed -r command to delete these sequences.
- Finally, I noticed that some values for the SponsorApplicant column included unnecessary spaces after the listed name. Therefore, I wrote a sed command to delete these extra spaces. The output given by this version of the command sequence only had spaces in the application sponsor names as was desired.
- Now that excess spaces were deleted, I needed to clean up the syntax so that the 9 data values would be read correctly by SQL.
- The most glaring issue with the cleanup up output was the preponderance of missing data values. Doing a google search, I found here that missing values are labelled as "NULL" in the command line. Thus, I had to find a way to replace all missing values with NULL.
- In every case except for the 9th data value at the end of the line, a null value was indicated by 2 commas in a row. Therefore, I wrote a sed command to insert the word NULL in between instances of 2 commas. Because occasionally there would be two missing values in a row, this command did not entirely work. Although I image there are more compact ways to address this issue, I simply repeated the command twice to clean up any skipped values.
- To handle the null values at the end of a line, I wrote a sed command to add NULL at the end of each line that ended with a comma.
- The most glaring issue with the cleanup up output was the preponderance of missing data values. Doing a google search, I found here that missing values are labelled as "NULL" in the command line. Thus, I had to find a way to replace all missing values with NULL.
cat application.txt | sed "s/,/~/g" | sed "s/ /,/g" | sed -r "s/( ){2}//g" | sed "s/,,/,NULL,/g;s/,,/,NULL,/g" | sed "s/ ,/,/g" | sed "s/,/,'/1;s/,/','/2;s/,/',/3;s/False/'False'/g;s/,/,'/5; s/,/',/6;s/,/,'/7;s/,/',/8" | sed "s/,\r$/,NULL/g;s/V\r$/'V'/g" | sed "s/'NULL'/NULL/g" | sed "s/~/,/g" | sed "s/^/insert into application(ApplNo, ApplType, SponsorApplicant, MostRecentLabelAvailableFlag, CurrentPatentFlag, ActionType, Chemical_Type, Ther_Potential, Orphan_Code) values (/g" | sed "s/$/);/g" | sed "1D" > ~/public_html/application2.sql.txtcat application.txt
Questions to Answer
- Provide the DDL (
create table
) statements that you used for your application and product tables. - Provide the
sed
command sequences that you used to convert the raw text files into sequences of SQLinsert
statements. - Using the command line, how can you determine the number of records in each file? Provide the command.
- Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL
select
statement. - In your database, are these numbers the same or different? Explain why you think so.
- For the remaining questions, supply (a) the answer to the question and (b) the SQL statement that you used to answer the question.
- What are the names of the drug products that are administered in the form
INJECTABLE;INTRAVENOUS, SUBCUTANEOUS
? - What are the names of the drug products whose active ingredient (activeingred) is
ATROPINE
? - In what forms and dosages can the drug product named
BENADRYL
be administered? - Which drug products have a name ending in
ESTROL
? - Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)
- Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)
- What are the names of the drug products that are sponsored (sponsor applicant column) by
MERCK
? - Which sponsor applicant companies have the text
LABS
in their names and have products whose active ingredients (activeingred) include bothASPIRIN
andCAFFEINE
?
Links
- User Page: Brandon Klein
- Team Page: The Class Whoopers
Assignments Pages
- Week 1 Assignment
- Week 2 Assignment
- Week 3 Assignment
- Week 4 Assignment
- Week 5 Assignment
- Week 6 Assignment
- Week 7 Assignment
- Week 8 Assignment
- Week 9 Assignment
- Week 10 Assignment
- Week 11 Assignment
- Week 12 Assignment
- No Week 13 Assignment
- Week 14 Assignment
- Week 15 Assignment
Individual Journal Entries
- Week 1 Individual Journal
- Week 2 Individual Journal
- Week 3 Individual Journal
- Week 4 Individual Journal
- Week 5 Individual Journal
- Week 6 Individual Journal
- Week 7 Individual Journal
- Week 8 Individual Journal
- Week 9 Individual Journal
- Week 10 Individual Journal
- Week 11 Individual Journal
- Week 12 Individual Journal
- No Week 13 Journal
- Week 14 Individual Journal
- Week 15 Individual Journal
- Week 1 Class Journal
- Week 2 Class Journal
- Week 3 Class Journal
- Week 4 Class Journal
- Week 5 Class Journal
- Week 6 Class Journal
- Week 7 Class Journal
- Week 8 Class Journal
- Week 9 Class Journal
- Week 10 Team Journal
- Week 11 Team Journal
- Week 12 Team Journal
- No Week 13 Journal
- Week 14 Team Journal
- Week 15 Team Journal