Troque Week 6

From LMU BioDB 2015
Revision as of 05:43, 15 October 2015 by Troque (Talk | contribs) (Questions to answer: Added some line breaks)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

User Page        Bio Databases Main Page       


Creating the Table on PostgreSQL

First, login to the LMU CS server using ssh. Type in the following in a command prompt (Windows) or terminal (Mac) window:

ssh <username@my.cs.lmu.edu>

Enter your password. Note: You will not visibly see the cursor move when typing in your password so just keep typing. You will end up on your home directory.

Here, you can use the command ls in order to see the list of files in the directory. Then we can start manipulating some files.

The process for downloading and unzipping the necessary files can be found here.

On the PostgreSQL command window, type the following to create a table called "application":

create table application (ApplNo int primary key, ApplType varchar, 
SponsorApplicant varchar, MostRecentLabelFlag boolean, CurrentPatentFlag boolean, 
ActionType varchar, ChemicalType int, Ther_Potential varchar, OrphanCode varchar);

To create the table for the product data, enter the following command on the PostgreSQL window:

create table product (ApplNo int references application, ProductNo int, Form varchar,
Dosage varchar, ProductMktStatus int, TECode varchar, ReferenceDrug int,
drugname varchar, activeingred varchar);

Now that we've set up the data table, we can move on to preparing the data we downloaded from the FDA.

Processing the Data on the Command Line

Formatting the application.txt file

Before starting this assignment, I didn't realized that I could easily remove the spaces using the regex for the tab character \t, and so I tried to take out the unnecessary blank spaces using the regex \s*. I looked up online how to easily remove blank spaces and stumbled upon both \t and \s*, but decided to use \s*. I decided to go through each comma and attach the quotes, but this proved quite ineffective, tedious, and time-consuming. The result of using this method is the following (which didn't really work out since I got frustrated from all the outliers in this txt file and I eventually trashed this set of commands):

cat application.txt | 
sed -e 's/\s\+/,/2' | 
sed -e 's/\s/,/1'| 
sed "s/,/,'/1" | 
sed "s/,/',/2" | 
sed "s/,/,'/2" | 
sed "s/\s*False/',False/1" | 
sed "s/False\s*False/FalseFalse/g" | 
sed "s/False\s*/False,/g" | 
sed "s/,/,'/5" | 
sed "s/,'../&',/3" | 
sed "s/,\s*/,/6" | 
sed "s/\t/,'/" | 
sed "s/\t/'/" | 
sed "s/P',$/&'','',''/g" | 
sed "s/TA',$/&'','',''/g" | 
sed "s/''/null/g" | 
sed "s/V\r$/,'V'/g" | 
sed "s/'S'/&,null/g" | 
sed "s/'P'/&,null/g" | 
sed "1d" | 
sed "s/\t//g" | 
sed "s/[0-9],null/&,null/" | 
sed "s/^/insert into application (ApplNo,ApplType,SponsorApplicant,MostRecentLabelFlag,CurrentPatentFlag,
                                  ActionType,ChemicalType,Ther_Potential,OrphanCode) values (/g" | 
sed "s/\r$/);/" | 
sed "s/null$/&);/g" | 
sed "s/'V'$/&);/" > ~/public_html/application.sql.txt

From doing this, I realized how badly formatted these data are so I decided to start over and so the following works. There were still outliers, but since I already knew what they were from my previous attempt, I decide to slightly hardcode the commands for the ther_potentials (P, S, P*, S*):

cat application.txt |
sed "s/\t/,/g" | 
sed "s/,/,'/1" | 
sed "s/,'[A-Z]/&'/1" | 
sed "s/,/,'/2" | 
sed "s/\s*,False/',False/1" | 
sed "s/,/,'/5 " | 
sed "s/,/',/6" | 
sed "s/,,/,null,/" |
sed "s/,,/,null,/" |
sed "s/,\r$/,null/g" |
sed "s/,S,/,'S',/g" |
sed "s/,P,/,'P',/g" |
sed "s/,V/,'V'/g" |
sed "s/P\*/'P\*'/g" |
sed "s/S\*/'S\*'/g" |
sed "s/,AP,/,'AP',/g" |
sed "s/$/);/g" |
sed "s/^/insert into application (ApplNo,ApplType,SponsorApplicant,MostRecentLabelFlag,
                                  CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,
                                  OrphanCode) values (/g" |
sed "1d"  > ~/public_html/application.sql.txt

But then I realized that there was an even easier way of doing this assignment since PostgreSQL accepts quotes around Boolean values and numbers. (I didn't notice the tips that were before the questions so I didn't realize right away that we could use less commands). This first removes the tabs, then the spaces, then attaches the closing parenthesis, then replaces the empty strings with null, then removes the first instance of a quote, then removes the header, and finally adds the necessary "insert..." commands for SQL:

cat application.txt |
sed "s/\t/','/g" |
sed "s/ *'/'/g" |
sed "s/\r$/');/g" |
sed "s/''/null/g" |
sed "s/'//1" |
sed "1d" |
sed "s/^/insert into application (ApplNo,ApplType,SponsorApplicant,MostRecentLabelFlag,
                                 CurrentPatentFlag,ActionType,ChemicalType,Ther_Potential,
                                 OrphanCode) values (/g" > ~/public_html/application.sql.txt

Note: I always forget to remove the headers first, which is why the command sed "1d" always shows up near the end.

Formatting product.txt file

Doing this file proved easier than I thought since I already knew to inspect the data first before attempting to do the processing. This data set contained some ' so I needed to remove those first before doing anything. In the same manner as the commands above, I replaced the tabs with quotes and a commas, added the closing parenthesis, replaced the empty strings, removed the first quote, removed the header, and placed the "insert..." commands for SQL processing:

cat Product.txt |
sed "s/'/\"/g" |
sed "s/\t/','/g"|
sed "s/\r$/');/g" |
sed "s/''/null/g" |
sed "s/'//1" |
sed "1d" |
sed "s/^/insert into product (ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,
                             ReferenceDrug,drugname,activeingred) values (/g" > ~/public_html/Product.sql.txt

Questions to answer

  • Provide the DDL (create table) statements that you used for your application and product tables.

The create command is needed in order to accomplish this. We look at the txt files in order to see what to name the headers and what kinds of values (ints, booleans, varchar, etc.) are required for them.
application.txt:

create table application (ApplNo int primary key, ApplType varchar, 
SponsorApplicant varchar, MostRecentLabelFlag boolean, CurrentPatentFlag boolean, 
ActionType varchar, ChemicalType int, Ther_Potential varchar, OrphanCode varchar);

Product.txt:

create table product (ApplNo int references application, ProductNo int, Form varchar,
Dosage varchar, ProductMktStatus int, TECode varchar, ReferenceDrug int,
drugname varchar, activeingred varchar);



  • Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.

See above description for more info about how I came up with these sed commands.
application.txt:

cat application.txt |
sed "s/\t/','/g" |
sed "s/ *'/'/g" |
sed "s/\r$/');/g" |
sed "s/''/null/g" |
sed "s/'//1" |
sed "1d" |
sed "s/^/insert into application (ApplNo,ApplType,SponsorApplicant,MostRecentLabelFlag,CurrentPatentFlag,ActionType,
                                 ChemicalType,Ther_Potential,OrphanCode) values (/g" > ~/public_html/application.sql.txt

Product.txt:

cat Product.txt |
sed "s/'/\"/g" |
sed "s/\t/','/g"|
sed "s/\r$/');/g" |
sed "s/''/null/g" |
sed "s/'//1" |
sed "1d" | 
sed "s/^/insert into product (ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,
                             ReferenceDrug,drugname,activeingred) values (/g" > ~/public_html/Product.sql.txt


  • Using the command line, how can you determine the number of records in each file? Provide the command.

In order to determine how many records are in each file, we use the wc since this gives us the number of lines, word count, and byte count for each file. The number of records in each file, therefore, is the number of lines minus 1 since the first line for each is actually a row of labels. application.txt = 19746 records and Product.txt = 32770 records. The commands are as follows:

application.txt:

cat application.txt | wc

Product.txt:

cat product.txt | wc



  • Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL select statement.

In SQL we have to use the keyword count. The result will be in a table with the column label count bigint. Using SQL, the commands are as follows:

application.txt

select count(*) from application;

The result is 19746.

Product.txt

select count(*) from product;

The result is 32770. So both numbers match up with what the command line gave.

  • In your database, are these numbers the same or different? Explain why you think so.

The numbers are actually the same since what wc gives is actually all the records plus the header. Since the SQL table technically gives the records + the 1 row of labels that it generated for its environment, the header matches up with the first line of the files.

  • For the remaining questions, supply (a) the answer to the question and (b) the SQL statement that you used to answer the question.
  • What are the names of the drug products that are administered in the form INJECTABLE;INTRAVENOUS, SUBCUTANEOUS?

(a) The drugs are "LOVENOX", "VELCADE", "VIDAZA", "ENOXAPARIN SODIUM", "ACTEMRA", and "AZACITIDINE".
(b) select drugname from product where form = 'INJECTABLE;INTRAVENOUS, SUBCUTANEOUS';.

  • What are the names of the drug products whose active ingredient (activeingred) is ATROPINE?

(a) There were only two, ATROPEN and ATROPINE with the active ingredient ATROPINE. Note: There were 4 instances of ATROPEN in the database.
(b) select drugname from product where activeingred = 'ATROPINE';

  • In what forms and dosages can the drug product named BENADRYL be administered?

(a) Note: I looked online to see how to make these tables.

form dosage
1 CAPSULE;ORAL 50MG
2 ELIXIR;ORAL 12.5MG/5ML
3 CAPSULE;ORAL 25MG
4 INJECTABLE;INJECTION 10MG/ML
5 INJECTABLE;INJECTION 50MG/ML

(b) select form, dosage from product where drugname = 'BENADRYL';

  • Which drug products have a name ending in ESTROL?

(a) "DIETHYLSTILBESTROL" (11 instances), "STILBESTROL" (12 instances), and "DIENESTROL" (1 instance).
(b) select drugname from product where drugname ~ 'ESTROL$';

  • Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each. (Side note: The therapeutic_potential codes are explained in the ReviewClass_Lookup.txt file, in case you’re interested.)

(a)

ther_potential count
1 15330
2 S* 3
3 S 3459
4 P 928
5 P* 26

(b) select ther_potential, count(*) from application group by ther_potential

  • Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each. (Side note: The chemical_type codes are explained in the ChemTypeLookup.txt file, in case you’re interested.)

(a)

ChemicalType count
1 15292
2 6 87
3 14 34
4 34 9
5 8 27
6 1 1271
7 2 139
8 3 1464
9 10 2
10 23 8
11 4 323
12 5 1046
13 9 1
14 24 8
15 7 35

(b) select chemicalType, count(*) from application group by chemicalType

  • What are the names of the drug products that are sponsored (sponsor applicant column) by MERCK?

(a) At first, I got a table with duplicates so the results totaled 140 at first. I then looked online (in this website: https://blog.udemy.com/select-distinct/) and found that I can use the word distinct to remove the duplicates. The resulting table then becomes a table with 67 unique values:
DECADRON, HUMORSOL, NEO-HYDELTRASOL, PRINIVIL, MAXALT-MLT, HYDROCORTONE, PERIACTIN, PROPECIA, PROSCAR, CLINORIL, PRINZIDE, ELSPAR, ALDOMET, ALDORIL D30, EMEND, DIUPRES-500, NEODECADRON, ALDOCLOR-250, MEFOXIN IN DEXTROSE 5% IN PLASTIC CONTAINER, ARAMINE, TIAMATE, BLOCADREN, CANCIDAS, CHIBROXIN, CORTONE, PEPCID, TRUSOPT, REDISOL, VIOXX, FLOROPRYL, ALPHAREDISOL, DECADRON-LA, MEFOXIN IN SODIUM CHLORIDE 0.9% IN PLASTIC CONTAINER, PEPCID RPD, NOROXIN, ZOCOR, LERITINE, DECADERM, DOLOBID, MANNITOL 25%, DECADRON W/ XYLOCAINE, ALDORIL D50, TIMOLIDE 10-25, CYCLAINE, HYDROPRES 25, AMINOHIPPURATE SODIUM, MEVACOR, MODURETIC 5-50, ALDORIL 25, SINGULAIR, COLBENEMID, DIUPRES-250, HYDELTRA-TBA, PRIMAXIN, BENEMID, MAXALT, FOSAMAX PLUS D, HYDRODIURIL, HYDELTRASOL, HYDROPRES 50, ZOLINZA, ALDORIL 15, FOSAMAX, DECASPRAY, COGENTIN, ALDOCLOR-150, PEPCID PRESERVATIVE FREE
(b) select distinct drugname from product inner join application on (product.ApplNo = application.ApplNo) where sponsorapplicant = 'MERCK'

  • Which sponsor applicant companies have the text LABS in their names and have products whose active ingredients (activeingred) include both ASPIRIN and CAFFEINE?

(a) It turns out that there are only 2 unique Sponsor Applicants with the conditions above: "ACTAVIS LABS UT INC" and "WATSON LABS"
(b) select distinct sponsorApplicant from application inner join product on (product.ApplNo = application.ApplNo) where (sponsorApplicant like '%LABS%' and product.activeingred like '%ASPIRIN%' and activeingred like '%CAFFEINE%')

Assignment Links

Weekly Assignments

Individual Journal Entries

Shared Journal Entries