Kevin Wyllie Week 6
From LMU BioDB 2015
Contents
Questions
- Provide the DDL (create table) statements that you used for your application and product tables.
- Application:
-
create table application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)
-
- Product
-
create table Product (ApplNo int, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus varchar, TEcode varchar, ReferenceDrug int, drugname varchar, activeingredient varchar)
-
- Application:
- Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.
- Application:
-
cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ ~/~/g" | sed "s/~~/~null~/g" | sed "s/~~/~null~/g" | sed "s/~\r$/~null/g" | sed "s/~/~'/1" | sed "s/~/~'/2" | sed "s/~/~'/5" | sed "s/~/~'/7" | sed "s/~/~'/8" | sed "s/~/'~/2" | sed "s/~/'~/3" | sed "s/~/'~/6" | sed "s/~/'~/8" | sed "s/$/'/g" | sed "s/'null'/null/g" | sed "s/~/,/g" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | sed "s/$/);/g" > ~/public_html/application.sql.txt
-
- Product:
-
cat Product.txt | sed "s/'/\"/g" | sed "s/\t/~/g" | sed "s/\*\*Federal Register determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/\*\*Federal Register notice determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/~~/~null~/g" | sed "s/~/~'/2" | sed "s/~/~'/3" | sed "s/~/~'/5" | sed "s/~/~'/7" |sed "s/~/~'/8" | sed "s/~/'~/3" | sed "s/~/'~/4" | sed "s/~/'~/6" | sed "s/~/'~/8" | sed "s/\r$/'/g" | sed "s/'null'/null/g" | sed "s/~/,/g" | sed "s/^/insert into Product (ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TEcode,ReferenceDrug,drugname,activeingredient) values(/g" | sed "s/$/);/g" > ~/public_html/Product.sql.txt
-
- Application:
- Using the command line, how can you determine the number of records in each file? Provide the command.
-
cat application.txt | wc
will give the number of lines, words and characters. - Application
- The number of lines is 19747, so the number of entries is 19746 (because the first line isn't an entry).
- Product
- The number of lines is 32771, so the number of entries is 32770.
-
- Using SQL, how can you determine the number of records in the table corresponding to the file? Provide the SQL select statement.
-
select count(*) from application
- Returns 19746 (same as before, minus the previously mentioned first line)
-
select count(*) from Product
- Returns 32770 entries (same as before, minus the previously mentioned first line)
-
- In your database, are these numbers the same or different? Explain why you think so.
- They're the same, because I already took into account the fact that the first line is not an actual entry.
- What are the names of the drug products that are administered in the form INJECTABLE;INTRAVENOUS, SUBCUTANEOUS?
-
select drugname from product where form = 'INJECTABLE;INTRAVENOUS, SUBCUTANEOUS'
- LOVENOX, VELCADE, VIDAZA, ENOXAPARIN SODIUM, ACTEMRA and AZACITIDINE.
-
- What are the names of the drug products whose active ingredient (activeingred) is ATROPINE?
-
select drugname from product where activeingredient = 'ATROPINE'
- Note, the column name in my table is
activeingredient
instead ofactiveingred
.
- Note, the column name in my table is
- ATROPINE and ATROPEN.
-
- In what forms and dosages can the drug product named BENADRYL be administered?
-
select form, dosage from Product where drugname = 'BENADRYL'
-
- Which drug products have a name ending in ESTROL?
-
select drugname from Product where drugname like '%ESTROL%'
- DIETHYLSTILBESTROL, STILBESTROL, DIENESTROL, MEGESTROL and ACETATE.
-
- Produce a table listing all of the known values for the therapeutic_potential column in the application table and how many application records there are of each.
-
select ther_potential, count (*) from application where ther_potential like '%' group by ther_potential
-
- Produce a table listing all of the known values for the chemical_type column in the application table and how many application records there are of each.
-
select chemical_type, count (*) from application where chemical_type >=0 or chemical_type <0 group by chemical_type
-
- What are the names of the drug products that are sponsored (sponsor applicant column) by MERCK?
-
select drugname from product inner join application on (product.ApplNo = application.ApplNo) where SponsorApplicant = 'MERCK' group by drugname
- "DECADRON", "HUMORSOL", "NEO-HYDELTRASOL", "PRINIVIL", "MAXALT-MLT", "HYDROCORTONE", "PERIACTIN", "PROPECIA", "PROSCAR", "CLINORIL", "PRINZIDE", "ELSPAR", "ALDOMET", "ALDORIL D30", "EMEND", "DIUPRES-500", "NEODECADRON", "ALDOCLOR-250", "MEFOXIN IN DEXTROSE 5% IN PLASTIC CONTAINER", "ARAMINE", "TIAMATE", "BLOCADREN", "CANCIDAS", "CHIBROXIN", "CORTONE", "PEPCID", "TRUSOPT", "REDISOL", "VIOXX", "FLOROPRYL", "ALPHAREDISOL", "DECADRON-LA", "MEFOXIN IN SODIUM CHLORIDE 0.9% IN PLASTIC CONTAINER", "PEPCID RPD", "NOROXIN", "ZOCOR", "LERITINE", "DECADERM", "DOLOBID", "MANNITOL 25%", "DECADRON W/ XYLOCAINE", "ALDORIL D50", "TIMOLIDE 10-25", "CYCLAINE", "HYDROPRES 25", "AMINOHIPPURATE SODIUM", "MEVACOR", "MODURETIC 5-50", "ALDORIL 25", "SINGULAIR", "COLBENEMID", "DIUPRES-250", "HYDELTRA-TBA", "PRIMAXIN", "BENEMID", "MAXALT", "FOSAMAX PLUS D", "HYDRODIURIL", "HYDELTRASOL", "HYDROPRES 50", "ZOLINZA", "ALDORIL 15", "FOSAMAX", "DECASPRAY", "COGENTIN", "ALDOCLOR-150", "PEPCID PRESERVATIVE FREE"
-
- Which sponsor applicant companies have the text LABS in their names and have products whose active ingredients (activeingred) include both ASPIRIN and CAFFEINE
-
select SponsorApplicant, from application inner join product on (application.ApplNo = product.ApplNo) where SponsorApplicant like '%LABS%' and activeingredient like '%ASPIRIN%' and activeingredient like '%CAFFEINE%' group by SponsorApplicant
- ACTAVIS LABS UT INC and WATSON LABS.
-
Protocol
Protocol - application.txt
- Veronica and I worked quite a bit with Anu, initially. To no avail, we spent approximately 3.5 hours attempting to convert application.txt into an SQL-friendly file (Sunday, 10/11/15).**
What We Tried First...
- Although columns were separated by tabs, in many cases they were also separated by a variable number of spaces (depending on the length of the Sponsor Applicant name). At the time, we thought this was our main obstacle in rewriting the lines for pgAdmin. The command:
sed "s/\t/, /g"
was an attempt to replace all tabs between columns with a comma followed by a space (we chose a comma since it is what SQL uses). But of course, this was when we discovered the extra spaces. You'll also notice that the formatting of the rows is inconsistent (a minority of the rows actually did what we expected them to do). This stressed us out too.
- Opting to address the parts of the file that I did know how to address, my next command was to remove the four spaces between the ApplType and SponsorApplicant columns, something that was consistent across all of the rows:
sed -r "s/( ){4}//g"
. What I had failed to realize was that this would still remove groups of four spaces from the aforementioned gaps between the SponsorApplicant and MostRecentLabelAvailableFlag columns, leaving the gaps as either doublet or triplet spaces. Seeing this, I added two more commands to remove these remaining spaces:sed -r "s/( ){3}//g" | sed -r "s/( ){2}//g"
.
- But there were still remaining spaces. So we used
sed "s/ , False/, False/g" | sed "s/ , True/, True/g" | sed "s/, /,/g" | sed "s/,/','/g" | sed "s/^/'/g"
to eliminate these spaces and add "single-quotes" around each entry (since those that don't **need** single-quotes can still function properly with them). This returned something that looked hopeful.
- Through prior experimentation with pgAdmin, we'd discovered that "null" was required for blank spaces in entries. So the final step was to add these nulls, with:
sed "s//null/g" | sed "s/,'\r$/,null/g" | sed "s/,'.\r$/&'/g"
. Again, the result seemed satisfactory.
- The final pipe was:
cat application.txt | sed "s/\t/, /g" | sed -r "s/( ){4}//g" | sed -r "s/( ){3}//g" | sed -r "s/( ){2}//g" | sed "s/ , False/, False/g" | sed "s/ , True/, True/g" | sed "s/, /,/g" | sed "s/,/','/g" | sed "s/^/'/g" | sed "s//null/g" | sed "s/,'\r$/,null/g" | sed "s/,'.\r$/&'/g" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values('/g" | sed "s/$/);/g" > ~/public_html/application.sql.txt
- However, after several rounds of adding the SQL syntax, attempting to past the result into pgAdmin, and receiving various error messages, we realized this somewhat convoluted pipe was not going to work.
...And Why it Didn't Work
- Most importantly, a few of the sponsor applicant names contained commas, which is a problem for this pipe, which assumes commas are used only as column dividers.
- Second, for reasons I can't explain, this pipe still leaves spaces after some sponsor applicant names (like in
'APPRECIA PHARMS CO '
, visible in the last screenshot) which may not seem like a serious issue, but would make searching for these names in SQL difficult because the space would be considered part of the name. - For many other reasons, most of which were unknown, we were never able to get pgAdmin to accept these commands.
A New Command
- As Anu discovered, one viable character for temporarily separating columns is the tilde (
~
) because it does not appear elsewhere in the file.-
sed "s/\t/~/g"
replaces all tabs with tildes...
-
- Still, the consecutive spaces (between SponsorApplicant and MostRecentLabelAvailableFlag) remain. But removing all spaces is problematic because many of the SponsorApplicant entries have spaces in them, which need to stay in the file.
-
sed "sed -r "s/( ){2}//g"
will remove most of the unnecessary spaces without removing those in sponsor applicant names.
-
- But the previous command leaves a few lingering single spaces at the end of sponsor applicant names. However these will always precede a tilde.
-
sed "s/ ~/~/g"
will remove these final spaces.
-
- Blank inputs need to be designated with "null." One way to do this is to select for consecutive tildes (remember, tildes now denote the beginning of the column, so the null should succeed the tilde).
-
sed "s/~~/~null~/g"
adds "null" between double tildes. - In the case of three consecutive tildes, the first execution of this command will interrupt the second pair in the triplet, causing sed to not recognize it. For this reason, this command must be entered twice.
-
- There are also some (many) blank orphan codes, which are in the last column. For this reason, sed has not recognized these as blank entries, up to this point. However, a line ending in a tilde conclusively means that the orphan code for that entry is blank.
-
sed "s/~\r/~null/g"
selects for tildes at the very end of the line and adds the last necessary null.
-
- The next step is to add "single quotes" flanking any "varchar" inputs.
- Start with the leading single quotes:
sed "s/~/~'/1" | sed "s/~/~'/2" | sed "s/~/~'/5" | sed "s/~/~'/7" | sed "s/~/~'/8"
- Then add the following single quotes:
sed "s/~/'~/2" | sed "s/~/'~/3" | sed "s/~/'~/6" | sed "s/~/'~/8"
- When the orphan code isn't blank, it is a varchar, so add a single quote to the end of the line:
sed "s/$/'/g"
- For some reason, the
\r$
syntax (mentioned on the week 6 assignment page), used to designate the end of a line, doesn't work here, while the usual$
does.
- For some reason, the
- Start with the leading single quotes:
- Blank inputs need to be entered as
null
specifically. If'null'
is used, this will be considered a standard varchar input.-
sed "s/'null'/null/g"
solves this issue.
-
- Now that column content is flanked by single quotes, the tildes can become commas.
-
sed "s/~/,/g"
-
- Now the SQL syntax must be added to each line, and the modified text file must be uploaded to your server. (See the PostgreSQL Tutorial)
-
sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | sed "s/$/);/g" > ~/public_html/application.sql.txt
- Note: this command is slightly different than on the tutorial. The final single-quote in the second sed command must be removed.
-
- Launch the SQL and create a new table (see answers to questions above for command).
- Go to the webpage (my.cs.lmu.edu/~username/application.sql.txt) and copy the text. Paste it into the SQL command window, delete the firs tline, and press the "execute query" button.
Protocol - Product.txt
- A grep search of the file returned no tildes, so again, that's a candidate for (temporarily) separating lines.
-
sed "s/\t/~/g"
-
- However, it was later discovered that there are some instances of single-quotes. These should be converted to double-quotes.
-
sed "s/'/\"/g"
-
- A quick visual scan of the original file reveals that there are several different intervening notices throughout. They all appear to begin with the pattern
Federal
so a grep search for this pattern revealed two variations of these notices (shown on right). These can/should be removed.-
sed "s/\*\*Federal Register determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g"
-
sed "s/\*\*Federal Register notice determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g"
-
- To insert the appropriate null's, use the command from the application.txt protocol.
-
sed "s/~~/~null~/g"
-
- As with application.txt, the next step is to flank all varchar columns with single-quotes.
-
sed "s/~/~'/2" | sed "s/~/~'/3" | sed "s/~/~'/5" | sed "s/~/~'/7" | sed "s/~/~'/8"
-
sed "s/~/'~/3" | sed "s/~/'~/4" | sed "s/~/'~/6" | sed "s/~/'~/8" | sed "s/\r$/'/g"
-
- Also mirroring the application.txt protocol, there are now some instances of
'null'
in the file which must be changed tonull
.-
sed "s/'null'/null/g"
-
- As before, now that varchars are flanked by single-quotes, tildes can be replaced with commas.
-
sed "s/~/,/g"
-
- Now add the SQL syntax (plus save the file onto a webpage).
-
sed "s/^/insert into Product (ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TEcode,ReferenceDrug,drugname,activeingredient) values(/g" | sed "s/$/);/g" > ~/public_html/Product.sql.txt
- Paste the body of text into the SQL window as done before.
-
Links
- Kevin Wyllie Week 2 (See the original assignment and class journal.)
- Kevin Wyllie Week 3 (See the original assignment and class journal.)
- Kevin Wyllie Week 4 (See the original assignment and class journal.)
- Kevin Wyllie Week 5 (See the original assignment and class journal.)
- Kevin Wyllie Week 6 (See the original assignment and class journal.)
- Kevin Wyllie Week 7 (See the original assignment and class journal.)
- Kevin Wyllie Week 8 (See the original assignment and class journal.)
- Kevin Wyllie Week 9 (See the original assignment and class journal.)
- Kevin Wyllie Week 10 (See the original assignment.)
- Kevin Wyllie Week 11 (See the original assignment.)
- Kevin Wyllie Week 12 (See the original assignment.)
- Kevin Wyllie Week 14 (See the original assignment.)
- Kevin Wyllie Week 15 (See the original assignment.)