Difference between revisions of "Nanguiano Week 6"

From LMU BioDB 2015
Jump to: navigation, search
(The FDA Drug Database: take two of processing the application.txt file)
(The FDA Drug Database: got the application.txt working. now onto product)
Line 113: Line 113:
 
  application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,
 
  application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,
 
  CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g" > application.sql.txt  
 
  CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g" > application.sql.txt  
* Again, I attempted to input this file into the database.
+
* Again, I attempted to input this file into the database. I obtained an error, stating that a value was too long for type character. As a first attempt to solve the problem, I converted all of the chars in the table to varchars.
 +
alter table application alter ApplType type varchar;
 +
alter table application alter Ther_Potential type varchar;
 +
alter table application alter Orphan_Code type varchar;
 +
* After this, I attempted the query again. This time, it completed successfully!
 +
 
 +
==== Formatting Product.txt ====
  
 
=== Questions to Answer ===
 
=== Questions to Answer ===
Line 120: Line 126:
 
  create table application (ApplNo int primary key, ApplType char, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean,  
 
  create table application (ApplNo int primary key, ApplType char, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean,  
 
  CurrentPatentFlag boolean, ActionType varchar, ChemicalType int, Ther_Potential char, Orphan_Code char);
 
  CurrentPatentFlag boolean, ActionType varchar, ChemicalType int, Ther_Potential char, Orphan_Code char);
 
 
  create table product (ApplNo int references application, ProductNo int primary key, Form varchar, Dosage varchar, ProductMktStatus int,  
 
  create table product (ApplNo int references application, ProductNo int primary key, Form varchar, Dosage varchar, ProductMktStatus int,  
 
  TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar);
 
  TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar);
 +
 +
alter table application alter ApplType type varchar;
 +
alter table application alter Ther_Potential type varchar;
 +
alter table application alter Orphan_Code type varchar;
 +
 
* Provide the <code>sed</code> command sequences that you used to convert the raw text files into sequences of SQL <code>insert</code> statements.
 
* Provide the <code>sed</code> command sequences that you used to convert the raw text files into sequences of SQL <code>insert</code> statements.
 
** application.txt:
 
** application.txt:
 +
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;
 +
s/\t/'/;s/,,/,null,/g;s/  *//g;s/V\r$/,'V');/g;s/\r$/,null);/g;s/,'',/,null,/g;s/^/insert into
 +
application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,
 +
CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g" > application.sql.txt
 
** Product.txt
 
** Product.txt
 
# Using the command line, how can you determine the number of records in each file? Provide the command.
 
# Using the command line, how can you determine the number of records in each file? Provide the command.

Revision as of 06:29, 14 October 2015

The FDA Drug Database

Direct Download/Unzipping Commands

Preparation

  • First, to begin the assignment, I ssh'd into my server.
ssh nanguiano@lion.lmu.edu
  • Next, I entered my folder for this class and created a directory for this week's assignment, then entered the directory.
cd biodb
mkdir week6
cd week6
  • In the directory, I downloaded the file from the FDA.
curl -O http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zip
  • After the download completed, I unzipped the file.
unzip UCM054599.zip
  • Next, I needed to move the application.txt and Product.txt files into my public_html folder. However, I first wanted to create a folder for the files. I entered my public_html folder and created a folder for this class, biodb.
cd ~/public_html
mkdir biodb
  • I noticed after making the folder that the "movie.sql.txt" file from a few classes ago was still there. I moved that file into the biodb folder, then returned to my week6 folder to begin the transfer of the files I needed to move.
mv movie.sql.txt biodb
cd ~/biodb/week6
  • I moved the files into the prepared folder in the public_html directory.
mv application.txt ~/public_html/biodb
mv Product.txt ~/public_html/biodb

Preparing the Database

  • Next, I needed to process the files to prepare them for entry into PostgreSQL. The files needed to be prepared with the insert commands so that they could be added into my database. Before doing this, i needed to ensure I knew the structure of the tables that I was going to enter the files into. To begin this process, I downloaded the two files and opened them in Excel so I could see what the header of each column was, and what type of data it contained.
    • Opening the "application.txt" file showed me that it contained 9 columns: ApplNo (int), ApplType (char), SponsorApplicant (varchar), MostRecentLabelAvailableFlag (bool), CurrentPatentFlag (bool), ActionType (varchar), ChemicalType (int), Ther_Potential (char), and Orphan_Code (char).
    • Opening the "Product.txt" file showed me that it also contained 9 columns: ApplNo (int), ProductNo (int), Form (varchar), Dosage (varchar), ProductMktStatus (int), TECode (varchar), ReferenceDrug (int), drugname (varchar), and activeingred (varchar).
  • I began by creating the database that I would be using for this week's assignment. I opened pgAdmin on my own computer, and created a new database, called "FDADrugDB". I opened the SQL editor and typed in the following commands to create my two tables:
create table application (ApplNo int primary key, ApplType char, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, 
CurrentPatentFlag boolean, ActionType varchar, ChemicalType int, Ther_Potential char, Orphan_Code char);

create table product (ApplNo int references application, ProductNo int primary key, Form varchar, Dosage varchar, ProductMktStatus int, 
TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar);

First attempt at formatting application.txt

  • After creating the tables, I needed to change the structure of the txt files so that I could insert them into the tables. I knew that I needed to add the insert command at the start of each line. I began with the application.txt file. I first ran the command to see whether or not I had the correct command to add the insert statements at the beginning of each line.
cat application.txt | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,
CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"
  • The command seemed to be correct, so next I needed to remove the first line, which contained the column headers.
cat application.txt | sed "1D" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,
CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"
  • With the first line removed and the insert added, I turned my attention towards formatting the varchar's to be surrounded with single quotes, and replacing the tabs between each piece of data with commas. First, to confirm that the tabs were being replaced properly, I created a random sed command to replace sed with the word "BIODB".
cat application.txt | sed "1D" | sed "s/\t/BIODB/g"
  • After confirming this command worked, I decided to replace the word "BIODB" with a comma and confirm that the results were the same.
cat application.txt | sed "1D" | sed "s/\t/,/g"
  • Once I saw that this worked, I knew I needed to surround the strings (varchars) and chars with commas. The strings/chars were the second, third, sixth, eigth, and ninth columns. The easiest way to do this in my mind was to insert the commas one at a time, putting quotes as needed around the necessary columns. To add the quotes around the second column character, I tested the following command.
cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/'&/" | sed "s/\t/,/"
  • After, I condensed the sed command to sed "s/\t/,'/;s/\t/'&/;s/\t/,/". After testing that, I realized that the last two commands could be merged into one sed command: sed "s/\t/','/". Adding this in, the current joined command was sed "s/\t/,'/;s/\t/','/". Next, i moved on to surrounding the next word with quotes. The next two columns didn't need to be surrounded with quotes, but the sixth one did. I tested the following command to see if it would work.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/" | sed "s/\t/',/"| sed "s/\t/,/" | sed "s/\t/,'/"
  • It worked exactly as expected. After condensing the sed command again to the command sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/". I used the same process to complete the parsing of the text.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/"
  • Concactenating the sed commands together, I got the following command.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/"
  • Next, I noticed that there were some blank columns. This would result in an error if I attempted to insert it into the table. In order to prevent an error, I'd need to insert null in the positions with nothing to insert. In order to do this, I initially tried the following.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/" | sed "s/,,/,null,/g"
  • This seemed to work, so I concactenating that sed command to the end of the exiting one.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;s/,,/,null,/g"
  • With that done, it was time to put everything together. First, I would add the first part of the command for the insert command. Then, I'd close it with );. I had already created the opening previously, and I could simply add the closing to the end of the last quote. I tested the full command to see if it worked.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/');/;s/,,/,null,/g" | 
sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,
ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g"
  • The test succeeded, so I combined the sed commands into one large sed command, then piped the result into a new file.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/');/;s/,,/,null,/g;
s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,
ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g" > application.sql.txt
  • Before parsing the Product.txt file, I decided to test the sql file I'd just created. Upon initially looking at it, I noticed two issues. The first was that I'd forgotten about the Orphan_Code column, and so several lines had a random V at the end. Additionally, each line without the V needed a null in the last column. So I went back to my concatenated sed command, separated it out, and began the second try to fix these issues.

Second attempt at formatting application.txt

  • To separate out the sed command, I used the following command:
sed "s/;/\" \| sed \"/g"
  • After fixing the error that was caused by the semicolon between s/\t/,'/;s/\t/');/;s/,,/,null,/g, I got my separated sed command of:
cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | 
sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/');/" | sed "s/,,/,null,/g" | sed "s/^/insert into application(ApplNo,ApplType,
SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,
Orphan_Code) values(/g" > application.sql.txt
  • I removed the piping into the file and the insertion to create just the command that formatted the original lines.
cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | 
sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g"
  • With my original command back, I began thinking on how to remove the whitespace and account for that last column. I knew that a single whitespace character should be ok, but more than one should be deleted and replaced. I first began with a command to remove all whitespace in order to ensure that it worked.
cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" |
sed "s/\t/,'/" | sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g" | sed "s/ //g"
  • Seeing that that worked, I knew now that I needed to search for more two or more spaces. Trying sed "s/ *//g" resulted in all whitespace being removed, but adding another space in between worked perfectly. The final command that removed the excess whitespace was as follows:
cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | 
sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g" | sed "s/   *//g"
  • With the whitespace issue fixed, I now needed to work on fixing the final column error. I knew that I needed to search for the presence of a V at the end of the line. If a V existed, I would need to put quotes around the V. If no V was there, I would need to add null. I began by testing just if I could find the V's at the end of the lines and surround them with quotes. I tested using the following command:
cat application.txt | sed "1D" | sed "s/\t/,'/" | sed "s/\t/','/" | sed "s/\t/',/" | sed "s/\t/,/" | sed "s/\t/,'/" | 
sed "s/\t/',/" | sed "s/\t/,'/" | sed "s/\t/'/" | sed "s/,,/,null,/g" | sed "s/   *//g" | sed "s/V\r$/,'V'/g"
  • Feeling relatively confidant with what I had up to this point, I concatenated the sed commands prior to the new one I added to shorten it a little bit.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;
s/,,/,null,/g;s/   *//g" | sed "s/V\r$/,'V'/g"
  • The command seemed to work perfectly. Next, I needed to find a way to add null to the ends of the lines. After some thinking, I deduced that just adding the null to every line, then removing it from lines with the 'V' would be the easiest option. I accomplished this with the following command:
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;s/\t/'/;
s/,,/,null,/g;s/   *//g" | sed "s/V\r$/,'V'/g" | sed "s/\r$/,null/g" | sed "s/,'V',null/,'V'/g"
  • Lastly, I wanted to replace the empty characters ('') with null as well. The command was as follows:
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;
s/\t/'/;s/,,/,null,/g;s/   *//g" | sed "s/V\r$/,'V'/g" | sed "s/\r$/,null/g" | sed "s/,'V',null/,'V'/g" | 
sed "s/,,/,null,/g"
  • After concatenating all the sed commands yet again, I re-added the commands for formatting the lines as sql commands. While attempting this, I noticed something unusual. The addition of ,null to the ends of the lines was not applying to the lines with 'V'. For one reason or another, I could no longer do any formatting to these lines using \r$. This, however, was beneficial, as it allowed me to simply add the SQL ending to the V directly, and eliminated the need for removing the null from the end of the V. After making the changes, this was the resulting command:
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;
s/\t/,'/;s/\t/'/;s/,,/,null,/g;s/   *//g" | sed "s/V\r$/,'V');/g;s/\r$/,null);/g;s/,,/,null,/g"  | 
sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,
MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,ChemicalType,
Ther_Potential,Orphan_Code) values(/g"
  • After condensing all the sed commands down, I piped the result into the application.sql.txt file.
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;
s/\t/'/;s/,,/,null,/g;s/   *//g;s/V\r$/,'V');/g;s/\r$/,null);/g;s/,,/,null,/g;s/^/insert into 
application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,
CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g" > application.sql.txt 
  • Again, I attempted to input this file into the database. I obtained an error, stating that a value was too long for type character. As a first attempt to solve the problem, I converted all of the chars in the table to varchars.
alter table application alter ApplType type varchar;
alter table application alter Ther_Potential type varchar;
alter table application alter Orphan_Code type varchar;
  • After this, I attempted the query again. This time, it completed successfully!

Formatting Product.txt

Questions to Answer

  • Provide the DDL (create table) statements that you used for your application and product tables.
create table application (ApplNo int primary key, ApplType char, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, 
CurrentPatentFlag boolean, ActionType varchar, ChemicalType int, Ther_Potential char, Orphan_Code char);
create table product (ApplNo int references application, ProductNo int primary key, Form varchar, Dosage varchar, ProductMktStatus int, 
TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar);

alter table application alter ApplType type varchar;
alter table application alter Ther_Potential type varchar;
alter table application alter Orphan_Code type varchar;
  • Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.
    • application.txt:
cat application.txt | sed "1D" | sed "s/\t/,'/;s/\t/','/;s/\t/',/;s/\t/,/;s/\t/,'/;s/\t/',/;s/\t/,'/;
s/\t/'/;s/,,/,null,/g;s/   *//g;s/V\r$/,'V');/g;s/\r$/,null);/g;s/,,/,null,/g;s/^/insert into 
application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,
CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,Orphan_Code) values(/g" > application.sql.txt 
    • Product.txt
  1. Using the command line, how can you determine the number of records in each file? Provide the command.
  2. Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL select statement.
  3. In your database, are these numbers the same or different? Explain why you think so.
    For the remaining questions, supply (a) the answer to the question and (b) the SQL statement that you used to answer the question.
  4. What are the names of the drug products that are administered in the form INJECTABLE;INTRAVENOUS, SUBCUTANEOUS?
  5. What are the names of the drug products whose active ingredient (activeingred) is ATROPINE?
  6. In what forms and dosages can the drug product named BENADRYL be administered?
  7. Which drug products have a name ending in ESTROL?
  8. Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)
  9. Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)
  10. What are the names of the drug products that are sponsored (sponsor applicant column) by MERCK?
  11. Which sponsor applicant companies have the text LABS in their names and have products whose active ingredients (activeingred) include both ASPIRIN and CAFFEINE?

Links

Nicole Anguiano
BIOL 367, Fall 2015

Assignment Links
Individual Journals
Shared Journals