Nanguiano Week 6
From LMU BioDB 2015
Contents
The FDA Drug Database
Direct Download/Unzipping Commands
Preparation
- First, to begin the assignment, I ssh'd into my server.
ssh nanguiano@lion.lmu.edu
- Next, I entered my folder for this class and created a directory for this week's assignment, then entered the directory.
cd biodb mkdir week6 cd week6
- In the directory, I downloaded the file from the FDA.
curl -O http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zip
- After the download completed, I unzipped the file.
unzip UCM054599.zip
- Next, I needed to move the application.txt and Product.txt files into my public_html folder. However, I first wanted to create a folder for the files. I entered my public_html folder and created a folder for this class, biodb.
cd ~/public_html mkdir biodb
- I noticed after making the folder that the "movie.sql.txt" file from a few classes ago was still there. I moved that file into the biodb folder, then returned to my week6 folder to begin the transfer of the files I needed to move.
mv movie.sql.txt biodb cd ~/biodb/week6
- I moved the files into the prepared folder in the public_html directory.
mv application.txt ~/public_html/biodb mv Product.txt ~/public_html/biodb
Preparing the Database
- Next, I needed to process the files to prepare them for entry into PostgreSQL. The files needed to be prepared with the insert commands so that they could be added into my database. Before doing this, i needed to ensure I knew the structure of the tables that I was going to enter the files into. To begin this process, I downloaded the two files and opened them in Excel so I could see what the header of each column was, and what type of data it contained.
- Opening the "application.txt" file showed me that it contained 9 columns: ApplNo (int), ApplType (char), SponsorApplicant (varchar), MostRecentLabelAvailableFlag (bool), CurrentPatentFlag (bool), ActionType (varchar), ChemicalType (int), Ther_Potential (char), and Orphan_Code (char).
- Opening the "Product.txt" file showed me that it also contained 9 columns: ApplNo (int), ProductNo (int), Form (varchar), Dosage (varchar), ProductMktStatus (int), TECode (varchar), ReferenceDrug (int), drugname (varchar), and activeingred (varchar).
- I began by creating the database that I would be using for this week's assignment. I opened pgAdmin on my own computer, and created a new database, called "FDADrugDB". I opened the SQL editor and typed in the following commands to create my two tables:
create table application (ApplNo int primary key, ApplType char, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, ChemicalType int, Ther_Potential char, Orphan_Code char); create table product (ApplNo int references application, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus int, TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar);
First attempt at formatting application.txt
- After creating the tables, I needed to change the structure of the txt files so that I could insert them into the tables. I knew that I needed to add the insert command at the start of each line. I began with the application.txt file. I first ran the command to see whether or not I had the correct command to add the insert statements at the beginning of each line.
cat application.txt | sed "s/^/insert into application(ApplNo,ApplType, SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType, ChemicalType,Ther_Potential,Orphan_Code) values(/g"
- The command seemed to be correct, so next I needed to remove the first line, which contained the column headers.
cat application.txt | sed "1D" | sed "s/^/insert into application(ApplNo, ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag, ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"
- With the first line removed and the insert added, I turned my attention towards formatting the varchar's to be surrounded with single quotes, and replacing the tabs between each piece of data with commas. First, to confirm that the tabs were being replaced properly, I created a random sed command to replace sed with the word "BIODB".
cat application.txt | sed "1D" | sed "s/\t/BIODB/g"
- After confirming this command worked, I decided to replace the word "BIODB" with a comma and confirm that the results were the same.
cat application.txt | sed "1D" | sed "s/\t/,/g"
- Once I saw that this worked, I knew I needed to surround the strings (varchars) and chars with commas. The strings/chars were the second, third, sixth, eigth, and ninth columns. The easiest way to do this in my mind was to insert the commas one at a time, putting quotes as needed around the necessary columns. To add the quotes around the second column character, I tested the following command.
cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/'&/" | sed "s/\t/,/"
- After, I condensed the sed command to
sed "s/\t/,'/;s/\t/'&/;s/\t/,/"
. After testing that, I realized that the last two commands could be merged into one sed command:sed "s/\t/','/"
. Adding this in, the current joined command wassed "s/\t/,'/;s/\t/','/"
. Next, i moved on to surrounding the next word with quotes. The next two columns didn't need to be surrounded with quotes, but the sixth one did. I tested the following command to see if it would work.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/" | sed "s/\t/',/"| sed "s/\t/,/" | sed "s/\t/,'/"
- It worked exactly as expected. After condensing the sed command again to the command
sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/"
. I used the same process to complete the parsing of the text.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/"
- Concactenating the sed commands together, I got the following command.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/; s/\t/',/;s/\t/,'/;s/\t/'/"
- Next, I noticed that there were some blank columns. This would result in an error if I attempted to insert it into the table. In order to prevent an error, I'd need to insert null in the positions with nothing to insert. In order to do this, I initially tried the following.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/; s/\t/',/;s/\t/,'/;s/\t/'/" | sed "s/,,/,null,/g"
- This seemed to work, so I concactenating that sed command to the end of the exiting one.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/; s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g"
- With that done, it was time to put everything together. First, I would add the first part of the command for the insert command. Then, I'd close it with
);
. I had already created the opening previously, and I could simply add the closing to the end of the last quote. I tested the full command to see if it worked.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/; s/\t/',/;s/\t/,'/;s/\t/');/;s/,,/,null,/g" | sed "s/^/insert into application( ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag, ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"
- The test succeeded, so I combined the sed commands into one large sed command, then piped the result into a new file.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/; s/\t/',/;s/\t/,'/;s/\t/');/;s/,,/,null,/g;s/^/insert into application(ApplNo, ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag, ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g" > application.sql.txt
- Before parsing the Product.txt file, I decided to test the sql file I'd just created. Upon initially looking at it, I noticed two issues. The first was that I'd forgotten about the Orphan_Code column, and so several lines had a random V at the end. Additionally, each line without the V needed a null in the last column. So I went back to my concatenated sed command, separated it out, and began the second try to fix these issues.
Second attempt at formatting application.txt
- To separate out the sed command, I used the following command:
sed "s/;/\" \| sed \"/g"
- After fixing the error that was caused by the semicolon between
s/\t/,'/;s/\t/');/;s/,,/,null,/g
, I got my separated sed command of:
cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/');/" | sed "s/,,/,null,/g" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag, CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g" > application.sql.txt
- I removed the piping into the file and the insertion to create just the command that formatted the original lines.
cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g"
- With my original command back, I began thinking on how to remove the whitespace and account for that last column. I knew that a single whitespace character should be ok, but more than one should be deleted and replaced. I first began with a command to remove all whitespace in order to ensure that it worked.
cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" |sed "s/\t/,'/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g" | sed "s/ //g"
- Seeing that that worked, I knew now that I needed to search for more two or more spaces. Trying
sed "s/ *//g"
resulted in all whitespace being removed, but adding another space in between worked perfectly. The final command that removed the excess whitespace was as follows:
cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g" | sed "s/ *//g"
- With the whitespace issue fixed, I now needed to work on fixing the final column error. I knew that I needed to search for the presence of a V at the end of the line. If a V existed, I would need to put quotes around the V. If no V was there, I would need to add null. I began by testing just if I could find the V's at the end of the lines and surround them with quotes. I tested using the following command:
cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g" | sed "s/ *//g" | sed "s/V\r$/,'V'/g"
- Feeling relatively confidant with what I had up to this point, I concatenated the sed commands prior to the new one I added to shorten it a little bit.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/; s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g;s/ *//g" | sed "s/V\r$/,'V'/g"
- The command seemed to work perfectly. Next, I needed to find a way to add null to the ends of the lines. After some thinking, I deduced that just adding the null to every line, then removing it from lines with the 'V' would be the easiest option. I accomplished this with the following command:
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/; s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g;s/ *//g" | sed "s/V\r$/,'V'/g" | sed "s/\r$/,null/g" | sed "s/,'V',null/,'V'/g"
- Lastly, I wanted to replace the empty characters ('') with null as well. The command was as follows:
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/; s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g;s/ *//g" | sed "s/V\r$/,'V'/g" | sed "s/\r$/,null/g" | sed "s/,'V',null/,'V'/g" | sed "s/,'',/,null,/g"
- After concatenating all the sed commands yet again, I re-added the commands for formatting the lines as sql commands. While attempting this, I noticed something unusual. The addition of
,null
to the ends of the lines was not applying to the lines with 'V'. For one reason or another, I could no longer do any formatting to these lines using\r$
. This, however, was beneficial, as it allowed me to simply add the SQL ending to the V directly, and eliminated the need for removing the null from the end of the V. After making the changes, this was the resulting command:
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/; s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g;s/ *//g" | sed "s/V\r$/,'V');/g; s/\r$/,null);/g;s/,'',/,null,/g" | sed "s/^/insert into application(ApplNo, ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag, ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"
- After condensing all the sed commands down, I piped the result into the application.sql.txt file.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/; s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g;s/ *//g;s/V\r$/,'V');/g; s/\r$/,null);/g;s/,'',/,null,/g;s/^/insert into application(ApplNo,ApplType, SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType, ChemicalType,Ther_Potential,Orphan_Code) values(/g" > application.sql.txt
- Again, I attempted to input this file into the database. I obtained an error, stating that a value was too long for type character. As a first attempt to solve the problem, I converted all of the chars in the table to varchars.
alter table application alter ApplType type varchar; alter table application alter Ther_Potential type varchar; alter table application alter Orphan_Code type varchar;
- After this, I attempted the query again. This time, it completed successfully!
Formatting Product.txt
- I began to edit the products.txt file by first removing the first line that contained only the column headers. A quick search of the file in excel revealed that there were several lines that contained single quotes. Those single quotes would need to be escaped using two single quotes. To test, I performed both of those commands at the same time.
cat Product.txt | sed "1D" | sed "s/'/''/g"
- After confirming that worked, I began separating the columns with commas, and wrapping the varchars with quotes. The third, fourth, sixth, eigth, and ninth columns needed to be wrapped in quotes. The procedure for doing so was the same as before: I stepped through each tab, adding quotes around the ones that needed quotes. The command in full was as follows:
cat Product.txt | sed "1D" | sed "s/'/''/g"| sed "s/\t/,/" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\r$/'/g"
- I condensed this sed command down to the following:
cat Product.txt | sed "1D" | sed "s/'/''/g;s/\t/,/;s/\t/,'/;s/\t/','/; s/\t/',/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/','/;s/\r$/'/g"
- After this, I replaced the empty chars ('') with null. Then, I added the closing bracket and semicolon to the end of each string.
cat Product.txt | sed "1D" | sed "s/'/''/g;s/\t/,/;s/\t/,'/;s/\t/','/; s/\t/',/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/','/;s/\r$/');/g" | sed "s/,'',/,null,/g"
- All that was left was to add the beginning portion of each query.
cat Product.txt | sed "1D" | sed "s/'/''/g;s/\t/,/;s/\t/,'/;s/\t/','/; s/\t/',/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/','/;s/\r$/');/g" | sed "s/,'',/,null,/g" | sed "s/^/insert into product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode, ReferenceDrug,drugname,activeingred) values(/g"
- After condensing the sed command, I piped the value into the Product.sql.txt file.
cat Product.txt | sed "1D" | sed "s/'/''/g;s/\t/,/;s/\t/,'/;s/\t/','/; s/\t/',/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/','/;s/\r$/');/g;s/,'',/,null,/g; s/^/insert into product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode, ReferenceDrug,drugname,activeingred) values(/g" > Product.sql.txt
- I accessed the file from my site, then attempted to run the query. The query completed successfully!
Questions to Answer
- Provide the DDL (
create table
) statements that you used for your application and product tables.- The rational for this was provided above, under the "Preparing the database" section.
create table application (ApplNo int primary key, ApplType char, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, ChemicalType int, Ther_Potential char, Orphan_Code char);
create table product (ApplNo int references application, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus int, TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar); alter table application alter ApplType type varchar; alter table application alter Ther_Potential type varchar; alter table application alter Orphan_Code type varchar;
- Provide the
sed
command sequences that you used to convert the raw text files into sequences of SQLinsert
statements.- The rationales for these are listed above.
- application.txt:
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/; s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g;s/ *//g" | sed "s/V\r$/,'V');/g; s/\r$/,null);/g;s/,'',/,null,/g" | sed "s/^/insert into application(ApplNo, ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag, ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"
- Product.txt
cat Product.txt | sed "1D" | sed "s/'//g;s/\t/,/;s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/','/;s/\r$/');/g; s/,,/,null,/g;s/^/insert into product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug, drugname,activeingred) values(/g" > Product.sql.txt
- Using the command line, how can you determine the number of records in each file? Provide the command.
- using the command
wc
can show you the number of lines in each file. Using wc, the first number is the number of lines. Because each record is on it's own line, that is the number we are looking for. Knowing that there is one line that contains the column headers, it can be assumed that the number of records is one less than the number of lines listed. Therefore application.txt has 19,746 records, and Product.txt has 32,770 records.
- using the command
wc application.txt && wc Product.txt 19747 147336 1615694 application.txt 32771 369379 2856017 Product.txt
- Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL
select
statement.- Using SQL, I can use the
select count(*)
command to find the number of records in the table.
- Using SQL, I can use the
select count(*) from product; -- returns 32,770 select count(*) from application; -- returns 19,746
- In your database, are these numbers the same or different? Explain why you think so.
- The initial numbers are different, because
wc
still finds the header text of the documents. So the SQL shows one less thanwc
does.
- The initial numbers are different, because
For the remaining questions, supply (a) the answer to the question and (b) the SQL statement that you used to answer the question.
- What are the names of the drug products that are administered in the form
INJECTABLE;INTRAVENOUS, SUBCUTANEOUS
?
select drugname from product where form = 'INJECTABLE;INTRAVENOUS, SUBCUTANEOUS';
- What are the names of the drug products whose active ingredient (activeingred) is
ATROPINE
?
select distinct drugname from product where activeingred = 'ATROPINE';
- In what forms and dosages can the drug product named
BENADRYL
be administered?
select form, dosage from product where drugname = 'BENADRYL';
- Which drug products have a name ending in
ESTROL
?
select distinct drugname from product where drugname like '%ESTROL';
- Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)
select distinct ther_potential, count(ther_potential) from application group by ther_potential;
- Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)
select distinct chemicaltype, count(chemicaltype) from application group by chemicaltype;
- What are the names of the drug products that are sponsored (sponsor applicant column) by
MERCK
?
select distinct drugname from product where applno in ( select applno from application where sponsorapplicant = 'MERCK' );
(click to enlarge)
- Which sponsor applicant companies have the text
LABS
in their names and have products whose active ingredients (activeingred) include bothASPIRIN
andCAFFEINE
?
select distinct sponsorapplicant from application where sponsorapplicant like '%LABS%' and applno in ( select applno from product where activeingred like '%ASPIRIN%' and activeingred like '%CAFFEINE%' );
Links
Nicole Anguiano
BIOL 367, Fall 2015
Assignment Links
- Week 1 Assignment
- Week 2 Assignment
- Week 3 Assignment
- Week 4 Assignment
- Week 5 Assignment
- Week 6 Assignment
- Week 7 Assignment
- Week 8 Assignment
- Week 9 Assignment
- Week 10 Assignment
- Week 11 Assignment
- Week 12 Assignment
- Week 14 Assignment
- Week 15 Assignment
Individual Journals
- Individual Journal Week 2
- Individual Journal Week 3
- Individual Journal Week 4
- Individual Journal Week 5
- Individual Journal Week 6
- Individual Journal Week 7
- Individual Journal Week 8
- Individual Journal Week 9
- Individual Journal Week 10
- Individual Journal Week 11
- Individual Assessment
- Deliverables