Difference between revisions of "Nanguiano Week 6"

Revision as of 02:14, 14 October 2015

The FDA Drug Database

Direct Download/Unzipping Commands

Preparation

First, to begin the assignment, I ssh'd into my server.

ssh nanguiano@lion.lmu.edu

Next, I entered my folder for this class and created a directory for this week's assignment, then entered the directory.

cd biodb
mkdir week6
cd week6

In the directory, I downloaded the file from the FDA.

curl -O http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zip

After the download completed, I unzipped the file.

unzip UCM054599.zip

Next, I needed to move the application.txt and Product.txt files into my public_html folder. However, I first wanted to create a folder for the files. I entered my public_html folder and created a folder for this class, biodb.

cd ~/public_html
mkdir biodb

I noticed after making the folder that the "movie.sql.txt" file from a few classes ago was still there. I moved that file into the biodb folder, then returned to my week6 folder to begin the transfer of the files I needed to move.

mv movie.sql.txt biodb
cd ~/biodb/week6

I moved the files into the prepared folder in the public_html directory.

mv application.txt ~/public_html/biodb
mv Product.txt ~/public_html/biodb

Preparing the Database

Next, I needed to process the files to prepare them for entry into PostgreSQL. The files needed to be prepared with the insert commands so that they could be added into my database. Before doing this, i needed to ensure I knew the structure of the tables that I was going to enter the files into. To begin this process, I downloaded the two files and opened them in Excel so I could see what the header of each column was, and what type of data it contained.
- Opening the "application.txt" file showed me that it contained 9 columns: ApplNo (int), ApplType (char), SponsorApplicant (varchar), MostRecentLabelAvailableFlag (bool), CurrentPatentFlag (bool), ActionType (varchar), ChemicalType (int), Ther_Potential (char), and Orphan_Code (char).
- Opening the "Product.txt" file showed me that it also contained 9 columns: ApplNo (int), ProductNo (int), Form (varchar), Dosage (varchar), ProductMktStatus (int), TECode (varchar), ReferenceDrug (int), drugname (varchar), and activeingred (varchar).
I began by creating the database that I would be using for this week's assignment. I opened pgAdmin on my own computer, and created a new database, called "FDADrugDB". I opened the SQL editor and typed in the following commands to create my two tables:

create table application (ApplNo int primary key, ApplType char, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, 
CurrentPatentFlag boolean, ActionType varchar, ChemicalType int, Ther_Potential char, Orphan_Code char);

create table product (ApplNo int references application, ProductNo int primary key, Form varchar, Dosage varchar, ProductMktStatus int, 
TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar);

First attempt at formatting application.txt

After creating the tables, I needed to change the structure of the txt files so that I could insert them into the tables. I knew that I needed to add the insert command at the start of each line. I began with the application.txt file. I first ran the command to see whether or not I had the correct command to add the insert statements at the beginning of each line.

cat application.txt | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,
CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"

The command seemed to be correct, so next I needed to remove the first line, which contained the column headers.

cat application.txt | sed "1D" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,
CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"

With the first line removed and the insert added, I turned my attention towards formatting the varchar's to be surrounded with single quotes, and replacing the tabs between each piece of data with commas. First, to confirm that the tabs were being replaced properly, I created a random sed command to replace sed with the word "BIODB".

cat application.txt | sed "1D" | sed "s/\t/BIODB/g"

After confirming this command worked, I decided to replace the word "BIODB" with a comma and confirm that the results were the same.

cat application.txt | sed "1D" | sed "s/\t/,/g"

Once I saw that this worked, I knew I needed to surround the strings (varchars) and chars with commas. The strings/chars were the second, third, sixth, eigth, and ninth columns. The easiest way to do this in my mind was to insert the commas one at a time, putting quotes as needed around the necessary columns. To add the quotes around the second column character, I tested the following command.

cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/'&/" | sed "s/\t/,/"

After, I condensed the sed command to sed "s/\t/,'/;s/\t/'&/;s/\t/,/". After testing that, I realized that the last two commands could be merged into one sed command: sed "s/\t/','/". Adding this in, the current joined command was sed "s/\t/,'/;s/\t/','/". Next, i moved on to surrounding the next word with quotes. The next two columns didn't need to be surrounded with quotes, but the sixth one did. I tested the following command to see if it would work.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/" | sed "s/\t/',/"| sed "s/\t/,/" | sed "s/\t/,'/"

It worked exactly as expected. After condensing the sed command again to the command sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/". I used the same process to complete the parsing of the text.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/"

Concactenating the sed commands together, I got the following command.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/"

Next, I noticed that there were some blank columns. This would result in an error if I attempted to insert it into the table. In order to prevent an error, I'd need to insert null in the positions with nothing to insert. In order to do this, I initially tried the following.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/" | sed "s/,,/,null,/g"

This seemed to work, so I concactenating that sed command to the end of the exiting one.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g"

With that done, it was time to put everything together. First, I would add the first part of the command for the insert command. Then, I'd close it with );. I had already created the opening previously, and I could simply add the closing to the end of the last quote. I tested the full command to see if it worked.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/');/;s/,,/,null,/g" | 
sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,
ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"

The test succeeded, so I combined the sed commands into one large sed command, then piped the result into a new file.

cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/');/;s/,,/,null,/g;
s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,
ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g" > application.sql.txt

Before parsing the Product.txt file, I decided to test the sql file I'd just created. Upon initially looking at it, I noticed two issues. The first was that I'd forgotten about the Orphan_Code column, and so several lines had a random V at the end. Additionally, each line without the V needed a null in the last column. So I went back to my concatenated sed command, separated it out, and began the second try to fix these issues.

Second attempt at formatting application.txt

To separate out the sed command, I used the following command:

sed "s/;/\" \| sed \"/g"

After fixing the error that was caused by the semicolon between s/\t/,'/;s/\t/');/;s/,,/,null,/g, I got my separated sed command of:

cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | 
sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/');/" | sed "s/,,/,null,/g" | sed "s/^/insert into application(ApplNo,ApplType,
SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,
Orphan_Code) values(/g" > application.sql.txt

I removed the piping into the file and the insertion to create just the command that formatted the original lines.

cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | 
sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g"

With my original command back, I began thinking on how to remove the whitespace and account for that last column. I knew that a single whitespace character should be ok, but more than one should be deleted and replaced. I first began with a command to remove all whitespace in order to ensure that it worked.

cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" |
sed "s/\t/,'/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g" | sed "s/ //g"

Seeing that that worked, I knew now that I needed to search for more two or more spaces. Trying sed "s/ *//g" resulted in all whitespace being removed, but adding another space in between worked perfectly. The final command that removed the excess whitespace was as follows:

cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | 
sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g" | sed "s/   *//g"

With the whitespace issue fixed, I now needed to work on fixing the final column error.

Questions to Answer

Provide the DDL (create table) statements that you used for your application and product tables.

create table application (ApplNo int primary key, ApplType char, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, 
CurrentPatentFlag boolean, ActionType varchar, ChemicalType int, Ther_Potential char, Orphan_Code char);

create table product (ApplNo int references application, ProductNo int primary key, Form varchar, Dosage varchar, ProductMktStatus int, 
TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar);

Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.
- application.txt:
- Product.txt

Using the command line, how can you determine the number of records in each file? Provide the command.
Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL select statement.
In your database, are these numbers the same or different? Explain why you think so.
For the remaining questions, supply (a) the answer to the question and (b) the SQL statement that you used to answer the question.
What are the names of the drug products that are administered in the form INJECTABLE;INTRAVENOUS, SUBCUTANEOUS?
What are the names of the drug products whose active ingredient (activeingred) is ATROPINE?
In what forms and dosages can the drug product named BENADRYL be administered?
Which drug products have a name ending in ESTROL?
Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)
Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)
What are the names of the drug products that are sponsored (sponsor applicant column) by MERCK?
Which sponsor applicant companies have the text LABS in their names and have products whose active ingredients (activeingred) include both ASPIRIN and CAFFEINE?

Links

Nicole Anguiano
BIOL 367, Fall 2015

Assignment Links

Individual Journals

Shared Journals

@@ Line 71: / Line 71: @@
 * Before parsing the Product.txt file, I decided to test the sql file I'd just created. Upon initially looking at it, I noticed two issues. The first was that I'd forgotten about the Orphan_Code column, and so several lines had a random V at the end. Additionally, each line without the V needed a null in the last column. So I went back to my concatenated sed command, separated it out, and began the second try to fix these issues.
-==== Second attempt at formatting applications.txt ====
+==== Second attempt at formatting application.txt ====
 * To separate out the sed command, I used the following command:
@@ Line 83: / Line 83: @@
   cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" |
   sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g"
-* With my original command back, I began thinking on how to remove the whitespace and account for that last column.
+* With my original command back, I began thinking on how to remove the whitespace and account for that last column. I knew that a single whitespace character should be ok, but more than one should be deleted and replaced. I first began with a command to remove all whitespace in order to ensure that it worked.
+ cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" |
+ sed "s/\t/,'/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g" | sed "s/ //g"
+* Seeing that that worked, I knew now that I needed to search for more two or more spaces. Trying <code>sed "s/  *//g"</code> resulted in all whitespace being removed, but adding another space in between worked perfectly. The final command that removed the excess whitespace was as follows:
+ cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" |
+ sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g" | sed "s/   *//g"
+* With the whitespace issue fixed, I now needed to work on fixing the final column error.
 === Questions to Answer ===

Difference between revisions of "Nanguiano Week 6"

Revision as of 02:14, 14 October 2015

Contents

The FDA Drug Database

Direct Download/Unzipping Commands

Preparation

Preparing the Database

First attempt at formatting application.txt

Second attempt at formatting application.txt

Questions to Answer

Links

Assignment Links

Individual Journals

Shared Journals

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools