Nanguiano Week 6

The FDA Drug Database

Direct Download/Unzipping Commands

Preparation

First, to begin the assignment, I ssh'd into my server.

ssh nanguiano@lion.lmu.edu

Next, I entered my folder for this class and created a directory for this week's assignment, then entered the directory.

cd biodb
mkdir week6
cd week6

In the directory, I downloaded the file from the FDA.

curl -O http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zip

After the download completed, I unzipped the file.

unzip UCM054599.zip

Next, I needed to move the application.txt and Product.txt files into my public_html folder. However, I first wanted to create a folder for the files. I entered my public_html folder and created a folder for this class, biodb.

cd ~/public_html
mkdir biodb

I noticed after making the folder that the "movie.sql.txt" file from a few classes ago was still there. I moved that file into the biodb folder, then returned to my week6 folder to begin the transfer of the files I needed to move.

mv movie.sql.txt biodb
cd ~/biodb/week6

I moved the files into the prepared folder in the public_html directory.

mv application.txt ~/public_html/biodb
mv Product.txt ~/public_html/biodb

Preparing the Database

Next, I needed to process the files to prepare them for entry into PostgreSQL. The files needed to be prepared with the insert commands so that they could be added into my database. Before doing this, i needed to ensure I knew the structure of the tables that I was going to enter the files into. To begin this process, I downloaded the two files and opened them in Excel so I could see what the header of each column was, and what type of data it contained.
- Opening the "application.txt" file showed me that it contained 9 columns: ApplNo (int), ApplType (char), SponsorApplicant (varchar), MostRecentLabelAvailableFlag (bool), CurrentPatentFlag (bool), ActionType (varchar), ChemicalType (int), Ther_Potential (char), and Orphan_Code (char).
- Opening the "Product.txt" file showed me that it also contained 9 columns: ApplNo (int), ProductNo (int), Form (varchar), Dosage (varchar), ProductMktStatus (int), TECode (varchar), ReferenceDrug (int), drugname (varchar), and activeingred (varchar).
I began by creating the database that I would be using for this week's assignment. I opened pgAdmin on my own computer, and created a new database, called "FDADrugDB". I opened the SQL editor and typed in the following commands to create my two tables:

create table application (ApplNo int primary key, ApplType char, SponsorApplicant
varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, 
ActionType varchar, ChemicalType int, Ther_Potential char, Orphan_Code char);

create table product (ApplNo int references application, ProductNo int, Form
varchar, Dosage varchar, ProductMktStatus int, TECode varchar, ReferenceDrug
int, drugname varchar, activeingred varchar);

First attempt at formatting application.txt

After creating the tables, I needed to change the structure of the txt files so that I could insert them into the tables. I knew that I needed to add the insert command at the start of each line. I began with the application.txt file. I first ran the command to see whether or not I had the correct command to add the insert statements at the beginning of each line.

cat application.txt | sed "s/^/insert into application(ApplNo,ApplType,
SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,
ChemicalType,Ther_Potential,Orphan_Code) values(/g"

The command seemed to be correct, so next I needed to remove the first line, which contained the column headers.

cat application.txt | sed "1D" | sed "s/^/insert into application(ApplNo,
ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,
ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"

With the first line removed and the insert added, I turned my attention towards formatting the varchar's to be surrounded with single quotes, and replacing the tabs between each piece of data with commas. First, to confirm that the tabs were being replaced properly, I created a random sed command to replace sed with the word "BIODB".

cat application.txt | sed "1D" | sed "s/\t/BIODB/g"

After confirming this command worked, I decided to replace the word "BIODB" with a comma and confirm that the results were the same.

cat application.txt | sed "1D" | sed "s/\t/,/g"

Once I saw that this worked, I knew I needed to surround the strings (varchars) and chars with commas. The strings/chars were the second, third, sixth, eigth, and ninth columns. The easiest way to do this in my mind was to insert the commas one at a time, putting quotes as needed around the necessary columns. To add the quotes around the second column character, I tested the following command.

cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/'&/" | sed "s/\t/,/"

After, I condensed the sed command to sed "s/\t/,'/;s/\t/'&/;s/\t/,/". After testing that, I realized that the last two commands could be merged into one sed command: sed "s/\t/','/". Adding this in, the current joined command was sed "s/\t/,'/;s/\t/','/". Next, i moved on to surrounding the next word with quotes. The next two columns didn't need to be surrounded with quotes, but the sixth one did. I tested the following command to see if it would work.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/" | sed "s/\t/',/"| 
sed "s/\t/,/" | sed "s/\t/,'/"

It worked exactly as expected. After condensing the sed command again to the command sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/". I used the same process to complete the parsing of the text.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/" | 
sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/"

Concactenating the sed commands together, I got the following command.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;
 s/\t/',/;s/\t/,'/;s/\t/'/"

Next, I noticed that there were some blank columns. This would result in an error if I attempted to insert it into the table. In order to prevent an error, I'd need to insert null in the positions with nothing to insert. In order to do this, I initially tried the following.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;
s/\t/',/;s/\t/,'/;s/\t/'/" | sed "s/,,/,null,/g"

This seemed to work, so I concactenating that sed command to the end of the exiting one.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;
s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g"

With that done, it was time to put everything together. First, I would add the first part of the command for the insert command. Then, I'd close it with );. I had already created the opening previously, and I could simply add the closing to the end of the last quote. I tested the full command to see if it worked.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;
s/\t/',/;s/\t/,'/;s/\t/');/;s/,,/,null,/g" | sed "s/^/insert into application(
ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,
ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"

The test succeeded, so I combined the sed commands into one large sed command, then piped the result into a new file.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;
s/\t/',/;s/\t/,'/;s/\t/');/;s/,,/,null,/g;s/^/insert into application(ApplNo,
ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,
ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g" > application.sql.txt

Before parsing the Product.txt file, I decided to test the sql file I'd just created. Upon initially looking at it, I noticed two issues. The first was that I'd forgotten about the Orphan_Code column, and so several lines had a random V at the end. Additionally, each line without the V needed a null in the last column. So I went back to my concatenated sed command, separated it out, and began the second try to fix these issues.

Second attempt at formatting application.txt

To separate out the sed command, I used the following command:

sed "s/;/\" \| sed \"/g"

After fixing the error that was caused by the semicolon between s/\t/,'/;s/\t/');/;s/,,/,null,/g, I got my separated sed command of:

cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | 
sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | sed "s/\t/',/" | 
sed "s/\t/,'/" | sed "s/\t/');/" | sed "s/,,/,null,/g" | sed "s/^/insert 
into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,
CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g" 
> application.sql.txt

I removed the piping into the file and the insertion to create just the command that formatted the original lines.

cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | 
sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | sed "s/\t/',/" | 
sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g"

With my original command back, I began thinking on how to remove the whitespace and account for that last column. I knew that a single whitespace character should be ok, but more than one should be deleted and replaced. I first began with a command to remove all whitespace in order to ensure that it worked.

cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | 
sed "s/\t/',/" | sed "s/\t/,/" |sed "s/\t/,'/" | sed "s/\t/',/" | 
sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g" | sed "s/ //g"

Seeing that that worked, I knew now that I needed to search for more two or more spaces. Trying sed "s/ *//g" resulted in all whitespace being removed, but adding another space in between worked perfectly. The final command that removed the excess whitespace was as follows:

cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | 
sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | sed "s/\t/',/" | 
sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g" | sed "s/   *//g"

With the whitespace issue fixed, I now needed to work on fixing the final column error. I knew that I needed to search for the presence of a V at the end of the line. If a V existed, I would need to put quotes around the V. If no V was there, I would need to add null. I began by testing just if I could find the V's at the end of the lines and surround them with quotes. I tested using the following command:

cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | 
sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | sed "s/\t/',/" | 
sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g" | sed "s/   *//g" | 
sed "s/V\r$/,'V'/g"

Feeling relatively confidant with what I had up to this point, I concatenated the sed commands prior to the new one I added to shorten it a little bit.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;
s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g;s/   *//g" | sed "s/V\r$/,'V'/g"

The command seemed to work perfectly. Next, I needed to find a way to add null to the ends of the lines. After some thinking, I deduced that just adding the null to every line, then removing it from lines with the 'V' would be the easiest option. I accomplished this with the following command:

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;
s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g;s/   *//g" | sed "s/V\r$/,'V'/g" | 
sed "s/\r$/,null/g" | sed "s/,'V',null/,'V'/g"

Lastly, I wanted to replace the empty characters ('') with null as well. The command was as follows:

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;
s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g;s/   *//g" | sed "s/V\r$/,'V'/g" | 
sed "s/\r$/,null/g" | sed "s/,'V',null/,'V'/g" | sed "s/,'',/,null,/g"

After concatenating all the sed commands yet again, I re-added the commands for formatting the lines as sql commands. While attempting this, I noticed something unusual. The addition of ,null to the ends of the lines was not applying to the lines with 'V'. For one reason or another, I could no longer do any formatting to these lines using \r$. This, however, was beneficial, as it allowed me to simply add the SQL ending to the V directly, and eliminated the need for removing the null from the end of the V. After making the changes, this was the resulting command:

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;
s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g;s/   *//g" | sed "s/V\r$/,'V');/g;
s/\r$/,null);/g;s/,'',/,null,/g"  | sed "s/^/insert into application(ApplNo,
ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,
ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"

After condensing all the sed commands down, I piped the result into the application.sql.txt file.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;
s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g;s/   *//g;s/V\r$/,'V');/g;
s/\r$/,null);/g;s/,'',/,null,/g;s/^/insert into application(ApplNo,ApplType,
SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,
ChemicalType,Ther_Potential,Orphan_Code) values(/g" > application.sql.txt

Again, I attempted to input this file into the database. I obtained an error, stating that a value was too long for type character. As a first attempt to solve the problem, I converted all of the chars in the table to varchars.

alter table application alter ApplType type varchar;
alter table application alter Ther_Potential type varchar;
alter table application alter Orphan_Code type varchar;

After this, I attempted the query again. This time, it completed successfully!

Formatting Product.txt

I began to edit the products.txt file by first removing the first line that contained only the column headers. A quick search of the file in excel revealed that there were several lines that contained single quotes. Those single quotes would need to be escaped using two single quotes. To test, I performed both of those commands at the same time.

cat Product.txt | sed "1D" | sed "s/'/''/g"

After confirming that worked, I began separating the columns with commas, and wrapping the varchars with quotes. The third, fourth, sixth, eigth, and ninth columns needed to be wrapped in quotes. The procedure for doing so was the same as before: I stepped through each tab, adding quotes around the ones that needed quotes. The command in full was as follows:

cat Product.txt | sed "1D" | sed "s/'/''/g"| sed "s/\t/,/" | sed "s/\t/,'/" | 
sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/',/" | 
sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\r$/'/g"

I condensed this sed command down to the following:

cat Product.txt | sed "1D" | sed "s/'/''/g;s/\t/,/;s/\t/,'/;s/\t/','/;
s/\t/',/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/','/;s/\r$/'/g"

After this, I replaced the empty chars ('') with null. Then, I added the closing bracket and semicolon to the end of each string.

cat Product.txt | sed "1D" | sed "s/'/''/g;s/\t/,/;s/\t/,'/;s/\t/','/;
s/\t/',/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/','/;s/\r$/');/g" | 
sed "s/,'',/,null,/g"

All that was left was to add the beginning portion of each query.

cat Product.txt | sed "1D" | sed "s/'/''/g;s/\t/,/;s/\t/,'/;s/\t/','/;
s/\t/',/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/','/;s/\r$/');/g" | sed "s/,'',/,null,/g" | 
sed "s/^/insert into product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,
ReferenceDrug,drugname,activeingred) values(/g"

After condensing the sed command, I piped the value into the Product.sql.txt file.

cat Product.txt | sed "1D" | sed "s/'/''/g;s/\t/,/;s/\t/,'/;s/\t/','/;
s/\t/',/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/','/;s/\r$/');/g;s/,'',/,null,/g;
s/^/insert into product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,
ReferenceDrug,drugname,activeingred) values(/g" > Product.sql.txt

I accessed the file from my site, then attempted to run the query. The query completed successfully!

Questions to Answer

Provide the DDL (create table) statements that you used for your application and product tables.
- The rational for this was provided above, under the "Preparing the database" section.

create table application (ApplNo int primary key, ApplType char, SponsorApplicant
varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, 
ActionType varchar, ChemicalType int, Ther_Potential char, Orphan_Code char);

create table product (ApplNo int references application, ProductNo int, Form
varchar, Dosage varchar, ProductMktStatus int, TECode varchar, ReferenceDrug
int, drugname varchar, activeingred varchar);

alter table application alter ApplType type varchar;
alter table application alter Ther_Potential type varchar;
alter table application alter Orphan_Code type varchar;

Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.
- The rationales for these are listed above.
- application.txt:

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;
s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g;s/   *//g" | sed "s/V\r$/,'V');/g;
s/\r$/,null);/g;s/,'',/,null,/g"  | sed "s/^/insert into application(ApplNo,
ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,
ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"

- Product.txt

cat Product.txt | sed "1D" | sed "s/'//g;s/\t/,/;s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/','/;s/\r$/');/g;
s/,,/,null,/g;s/^/insert into product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug,
drugname,activeingred) values(/g" > Product.sql.txt

Using the command line, how can you determine the number of records in each file? Provide the command.
- using the command wc can show you the number of lines in each file. Using wc, the first number is the number of lines. Because each record is on it's own line, that is the number we are looking for. Knowing that there is one line that contains the column headers, it can be assumed that the number of records is one less than the number of lines listed. Therefore application.txt has 19,746 records, and Product.txt has 32,770 records.

wc application.txt && wc Product.txt
19747  147336 1615694 application.txt
32771  369379 2856017 Product.txt

Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL select statement.
- Using SQL, I can use the select count(*) command to find the number of records in the table.

select count(*) from product; -- returns 32,770
select count(*) from application; -- returns 19,746

In your database, are these numbers the same or different? Explain why you think so.
- The initial numbers are different, because wc still finds the header text of the documents. So the SQL shows one less than wc does.

For the remaining questions, supply (a) the answer to the question and (b) the SQL statement that you used to answer the question.

What are the names of the drug products that are administered in the form INJECTABLE;INTRAVENOUS, SUBCUTANEOUS?

select drugname from product where form = 'INJECTABLE;INTRAVENOUS, SUBCUTANEOUS';

What are the names of the drug products whose active ingredient (activeingred) is ATROPINE?

select distinct drugname from product where activeingred = 'ATROPINE';

In what forms and dosages can the drug product named BENADRYL be administered?

select form, dosage from product where drugname = 'BENADRYL';

Which drug products have a name ending in ESTROL?

select distinct drugname from product where drugname like '%ESTROL';

Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)

select distinct ther_potential, count(ther_potential) from application group by ther_potential;

Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)

select distinct chemicaltype, count(chemicaltype) from application group by chemicaltype;

What are the names of the drug products that are sponsored (sponsor applicant column) by MERCK?

select distinct drugname from product where applno in (
  select applno from application where sponsorapplicant = 'MERCK'
);

(click to enlarge)

Which sponsor applicant companies have the text LABS in their names and have products whose active ingredients (activeingred) include both ASPIRIN and CAFFEINE?

select distinct sponsorapplicant from application where sponsorapplicant like '%LABS%' and applno in (
  select applno from product where activeingred like '%ASPIRIN%' and activeingred like '%CAFFEINE%'
);

Links

Nicole Anguiano
BIOL 367, Fall 2015

Assignment Links

Individual Journals

Shared Journals

Nanguiano Week 6

Contents

The FDA Drug Database

Direct Download/Unzipping Commands

Preparation

Preparing the Database

First attempt at formatting application.txt

Second attempt at formatting application.txt

Formatting Product.txt

Questions to Answer

Links

Assignment Links

Individual Journals

Shared Journals

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools