Lenaolufson Week 6
From LMU BioDB 2015
Contents
The FDA Drug Database
Direct Download/Unzipping Commands
- First, I logged on with my username and password and then entered the class folder and created a directory for this assignment, and then entered the directory:
ssh eolufson@my.cs.lmu cdbiodb mdir week6 cd week6
- Next, I downloaded the file from the website using:
curl -O http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zip
- Then I unzipped the file (using my class resources):
unzip UCM054599.zip
- I then moved the product.txt and the application.txt files into my public folder, but first created a new folder for them that would help me stay organized:
cd ~/public_html mkdir biodb mv application.txt ~/public_html/biodb mv product.txt ~/public_html/biodb
- Next I was able to determine the variable for the tables by looking at the top columns:
- application.txt contained 9 variables: ApplNo, ApplType, SponsorApplicant, MostRecentLabelAvailableFlag, CurrentPatentFlag, ActionType, Chemical_Type, Ther_Potential, and Orphan_Code; and the data types are as follows (respectively): int primary key, varchar, varchar, boolean, boolean, varchar, int, varchar, and varchar.
- product.txt contained 9 variables: ApplNo, ProductNo, Form, Dosage, ProductMktStatus, TECode, ReferenceDrug, drugname, activeingred; and the data types are as follows (respectively): int, int, varchar, varchar, varchar, varchar, int, varchar, and varchar.
Creating the Database
- Using pgAdmin as well as the SQL editor, I was able to create the tables needed for the data:
create table application (ApplNo int primary key, ApplType char, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, ChemicalType int, Ther_Potential char, Orphan_Code char) create table product (ApplNo int references application, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus int, TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar)
- Now that my tables were set up, I needed to add in the appropriate data and I used sed command to do this, but I first realized that the beginning of the application.txt is not needed so I removed the first line. So, my command was this:
cat application.txt | sed "1D"
- After being stuck in a state of confusion as to what my next move was because the screen I was viewing had a lot of what appeared to be unnecessary spaces in between randomly. After talking it over with a classmate, I was able to see that some of these spaces were used with the data. So, I took out the spaces that were next to another space as those were not needed using:
cat application.txt | sed "1D" | sed "s/ / /g"
- I then needed to insert a null value into the tabs that were next to each other by:
cat application.txt | sed "1D" | sed "s/ / /g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\t/\tnull\t/g"
- Upon doing this, I soon realized that some of my rows lacked the 9th column they needed. I corrected this by doing a similar command, but this time using the carriage return. From asking a classmate as well as the discussion in class, I learned that it is a character that is basically invisible at the end of every line and it is recorded whenever the enter key is hit. The command was:
cat application.txt | sed "1D" | sed "s/ / /g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\r/\tnull\r/g"
- My next step was to replace the tabs with commas and insert apostrophes for the varchars using the long command line of:
cat application.txt | sed "1D" | sed "s/ //g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\r/\tnull\r/g" | sed "s/\t/,'/1" | sed "s/\t/','/1" | sed "s/\t/','/1" | sed "s/\t/','/1" | sed "s/\t/','/1" | sed "s/\t/',/1" | sed "s/\t/,'/1" | sed "s/\t/','/g" | sed "s/\r/'\r/g" | sed "s/'null'/null/g" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | sed "s/\r/);\r/g" > ~/public_html/application.sql.txt
- I was able to then use the information I had learned from creating the application.txt to create the product.txt command line. It became apparent that the first line of the product.txt was useless as well, just like the application.txt, so I removed it. The final command I came up with for the product.txt is:
cat Product.txt | sed "1D" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\t/\tnull\t/g" | sed "s/\t\r/\tnull\r/g" | sed "s/'//g" | sed "s/\t/,/1" | sed "s/\t/,'/1" | sed "s/\t/','/1" | sed "s/\t/',/1" | sed "s/\t/,'/1"| sed "s/\t/',/1" | sed "s/\t/,'/1" | sed "s/\t/','/1" | sed "s/\r/'\r/g" | sed "s/'null'/null/g" | sed "s/^/insert into product(ApplNo, ProductNo, Form, Dosage, ProductMktStatus, TECode, ReferenceDrug, drugname, activeingred) values(/g" | sed "s/\r/);\r/g" > ~/public_html/product.sql.txt
Questions to Answer
- Provide the DDL (create table) statements that you used for your application and product tables.
application.txt:
create table application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelFlag boolean, CurrentPatentFlag boolean, ActionType varchar, ChemicalType int, Ther_Potential varchar, OrphanCode varchar);
product.txt:
create table product (ApplNo int references application, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus int, TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar)
- Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.
product.txt:
cat Product.txt | sed "s/'/\"/g" | sed "s/\t/','/g"| sed "s/\r$/');/g" | sed "s//null/g" | sed "s/'//1" | sed "1d" | sed "s/^/insert into product (ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode, ReferenceDrug,drugname,activeingred) values (/g" > ~/public_html/Product.sql.txt
application.txt:
cat application.txt | sed "s/\t/','/g" | sed "s/ *'/'/g" | sed "s/\r$/');/g" | sed "s//null/g" | sed "s/'//1" | sed "1d" | sed "s/^/insert into application (ApplNo,ApplType,SponsorApplicant,MostRecentLabelFlag,CurrentPatentFlag,ActionType, ChemicalType,Ther_Potential,OrphanCode) values (/g" > ~/public_html/application.sql.txt
- Using the command line, how can you determine the number of records in each file? Provide the command.
The command wc
tells us the number of lines, the word count, and byte count for every file. For this assignment, one line is used for the column headers so the number of records is one less than the number of lines listed. Thus, application.txt has 19,746 records and product.txt has 32,770 records. Commands are:
cat application.txt | wc cat product.txt | wc