Bklein7 Week 6
From LMU BioDB 2015
Contents
The FDA Drug Database
Acquiring Data From the Drugs@FDA ZIP File
Downloading and Viewing
- To download the zip file directly from the FDA website, I typed the following into the command line:
curl -O http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zip
- Next, I unziped the FDA file
unzip UCM054599.zip
- The resulting output included the application.txt and Product.txt files. To view the format of these files, I used more commands. The output I received appeared to have listed columns which were not clearly defined. Therefore, I visualized the data using Notepad. This made the columns easier to read.
- In Notepad, I got a better idea of how I needed to edit the txt files to import them into a table in pgAdminIII.
Editing the txt files with sed
- application.txt
- Special Characters
- To begin, I wanted to determine if there were any special characters in the text file such as semicolons or apostrophes that needed to be worked around. I did two searches in notepad for these characters and found that this was not the case.
- Removing Excess Tabs and Spaces
- Next, I began the pipe with a cat command and then wanted to clean up the file to make it easier to read.
- Data values were separated by tab delimiters on the FDA website. Therefore, I began by researching how to manipulate tabs on the command line. To figure out how to replace the tabs with commas, I did some google searching. I found a way to tackle this problem on electrictoolbox. Using this piece of information, I wrote a sed command to replace tabs with commas. This cleaned up excess spaces from the file, although not entirely.
- In reviewing the output, I found there to be multiple instances of sequences of 2 or 4 spaces. Therefore, I wrote a sed -r command to delete these sequences.
- Finally, I noticed that some values for the SponsorApplicant column included unnecessary spaces after the listed name. Therefore, I wrote a sed command to delete these extra spaces. The output given by this version of the command sequence only had spaces in the application sponsor names as was desired.
- Next, I began the pipe with a cat command and then wanted to clean up the file to make it easier to read.
- Formatting the Values
- Now that excess spaces were deleted, I needed to clean up the syntax so that the 9 data values would be read correctly by SQL.
- The most glaring issue with the cleanup up output was the preponderance of missing data values. Doing a google search, I found here that missing values are labelled as "NULL" in the command line. Thus, I had to find a way to replace all missing values with NULL.
- In every case except for the 9th data value at the end of the line, a null value was indicated by 2 commas in a row. Therefore, I wrote a sed command to insert the word NULL in between instances of 2 commas. Because occasionally there would be two missing values in a row, this command did not entirely work. Although I image there are more compact ways to address this issue, I simply repeated the command twice to clean up any skipped values.
- To handle the null values at the end of a line, I wrote a sed command to add NULL at the end of each line that ended with a comma.
- The next issue in formatting the data values for insert was placing all varchar inputs within single quotes. This presented a challenge because only some of the data values were varchars, whereas others were intergers. Additionally, some values were absent (null) and thus could not be set off by single quotes.
- To tackle this issue, I wrote a long string of linked sed commands to selectively add single quotes before/after specific commas in each line. This way, I only added single quotes to data values in varchar columns.
- To address the null values issue, I wrote a sed command to delete single quotes present around any occurrences of the word null.
- In reviewing the output of the above commands, I noticed that some SponsorApplicant entries included commas as part of the data. This rendered my selective adding of single quotes around commas technique as ineffective for these lines. To solve this problem, I added a sed command at the very beginning of the pipe to replace any commas within the raw data with tildes. Then, at the end of the pipe, I added a command to replace these tildes with commas once again. This fix in formatting the data worked properly.
- The most glaring issue with the cleanup up output was the preponderance of missing data values. Doing a google search, I found here that missing values are labelled as "NULL" in the command line. Thus, I had to find a way to replace all missing values with NULL.
- Now that excess spaces were deleted, I needed to clean up the syntax so that the 9 data values would be read correctly by SQL.
- Writing the Values as Part of SQL Commands
- The next step in formatting the application.txt file was edit each line so it would read off as an SQL command.
- Referencing the application.txt file in Notepad, I identified all the column names for the data. Then using a sed command, I added the appropriate insert into table string to the beginning of each line.
- Next, I wrote a sed command to add the appropriate SQL closing to each line.
- The next step in formatting the application.txt file was edit each line so it would read off as an SQL command.
- Exporting the Data
- Finally, the data output now appeared ready to export. To close out, I added a command to export the output to ~/public_html/application.sql.txt
- When attempting to retrieve the above file, I noticed I forgot to delete the first line in the output (column labels). Therefore, I wrote one last sed command and exported this corrected file to ~/public_html/application2.sql.txt
- Finally, the data output now appeared ready to export. To close out, I added a command to export the output to ~/public_html/application.sql.txt
- Final Product
- The final command sequence used to create the input for SQL is as follows:
- Special Characters
cat application.txt | sed "s/,/~/g" | sed "s/ /,/g" | sed -r "s/( ){2}//g" | sed "s/,,/,NULL,/g;s/,,/,NULL,/g" | sed "s/ ,/,/g" | sed "s/,/,'/1;s/,/','/2; s/,/',/3;s/False/'False'/g;s/,/,'/5;s/,/',/6;s/,/,'/7;s/,/',/8" | sed "s/,\r$/,NULL/g; s/V\r$/'V'/g" | sed "s/'NULL'/NULL/g" | sed "s/~/,/g" | sed "s/^/insert into application(ApplNo, ApplType, SponsorApplicant, MostRecentLabelAvailableFlag, CurrentPatentFlag, ActionType, Chemical_Type, Ther_Potential, Orphan_Code) values (/g" | sed "s/$/);/g" | sed "1D" > ~/public_html/application2.sql.txt
- Product.txt
- In crafting this command sequence, I followed the procedure above and used many similar commands.
cat Product.txt | sed "s/'//g" | sed "s/,/<comma>/g" | sed "s/ /,/g" | sed "s/,,/,NULL,/g;s/,,/,NULL,/g" | sed "s/,\r$/,NULL/g" | sed "s/,/,'/2;s/,/','/3; s/,/',/4;s/,/,'/5;s/,/',/6;s/,/,'/7;s/,/','/8;s/\r$/'/g" | sed "s/'NULL'/NULL/g" | sed "s/<comma>/,/g" | sed "s/^/insert into Product(ApplNo, ProductNo, Form, Dosage, ProductMktStatus, TECode, ReferenceDrug, drugname, activeingred) values (/g" | sed "s/$/);/g" | sed "1D" > ~/public_html/Product.sql.txt
Questions to Answer
- Provide the DDL (
create table
) statements that you used for your application and product tables. - Provide the
sed
command sequences that you used to convert the raw text files into sequences of SQLinsert
statements. - Using the command line, how can you determine the number of records in each file? Provide the command.
- Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL
select
statement. - In your database, are these numbers the same or different? Explain why you think so.
- For the remaining questions, supply (a) the answer to the question and (b) the SQL statement that you used to answer the question.
- What are the names of the drug products that are administered in the form
INJECTABLE;INTRAVENOUS, SUBCUTANEOUS
? - What are the names of the drug products whose active ingredient (activeingred) is
ATROPINE
? - In what forms and dosages can the drug product named
BENADRYL
be administered? - Which drug products have a name ending in
ESTROL
? - Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)
- Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)
- What are the names of the drug products that are sponsored (sponsor applicant column) by
MERCK
? - Which sponsor applicant companies have the text
LABS
in their names and have products whose active ingredients (activeingred) include bothASPIRIN
andCAFFEINE
?
Links
- User Page: Brandon Klein
- Team Page: The Class Whoopers
Assignments Pages
- Week 1 Assignment
- Week 2 Assignment
- Week 3 Assignment
- Week 4 Assignment
- Week 5 Assignment
- Week 6 Assignment
- Week 7 Assignment
- Week 8 Assignment
- Week 9 Assignment
- Week 10 Assignment
- Week 11 Assignment
- Week 12 Assignment
- No Week 13 Assignment
- Week 14 Assignment
- Week 15 Assignment
Individual Journal Entries
- Week 1 Individual Journal
- Week 2 Individual Journal
- Week 3 Individual Journal
- Week 4 Individual Journal
- Week 5 Individual Journal
- Week 6 Individual Journal
- Week 7 Individual Journal
- Week 8 Individual Journal
- Week 9 Individual Journal
- Week 10 Individual Journal
- Week 11 Individual Journal
- Week 12 Individual Journal
- No Week 13 Journal
- Week 14 Individual Journal
- Week 15 Individual Journal
- Week 1 Class Journal
- Week 2 Class Journal
- Week 3 Class Journal
- Week 4 Class Journal
- Week 5 Class Journal
- Week 6 Class Journal
- Week 7 Class Journal
- Week 8 Class Journal
- Week 9 Class Journal
- Week 10 Team Journal
- Week 11 Team Journal
- Week 12 Team Journal
- No Week 13 Journal
- Week 14 Team Journal
- Week 15 Team Journal