Kzebrows Week 6
Questions to Answer
For this week's assignment I worked in the lab in Seaver to use the PostgreSQL tool installed on the computers. I used
kzebrows@my.cs.lmu.edu
followed by my password to log into the LMU CMSI database.
Next, following the Direct Download/Unzipping Commands section of the Week 6 Assignment, I downloaded the FDA drug file directly to my.cs.lmu.edu and unzipped it. I opened each file using the cat application.txt
or cat Product.txt
commands to view them first.
Provide the DDL (create table
) statements that you used for your application and product tables.
This required me to open the PostgreSQL Tutorial wiki page. I then created the following commands in SQL, following the protocol we did in class with the movie database. The command began with create table [application/Product]
. Then, as Application was the primary key, the first thing I entered in parentheses after the create table command was (ApplNo int primary key
; however, because Product has a foreign key as indicated by the assignment page, I just used (ApplNo int
. Then I entered the names of each column that I viewed by opening the Product and Application files on Notepad, followed by either int for integers, varchar for varying characters, and boolean for columns with True or False values. The following commands to create tables in SQL looked like this:
Application:
create table Application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)
Product:
create table Product (ApplNo int, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus varchar, TECode int, ReferenceDrug int, drugname varchar, activeingred varchar)
Note that originally, the TECode column was entered as int; however, upon creating the command (see next question) to convert the files from sed
to SQL insert
statements, I realized that because not everything in this code was an integer (SQL pointed this out to me with an instance of AA in the TECode column when I tried to insert the data into the SQL table) the command needed to be changed.
create table Product (ApplNo int, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus varchar, TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar)
Provide the sed
command sequences that you used to convert the raw text files into sequences of SQL insert
statements.
This part of the assignment required the most time and effort. I worked closely with my homework partner, Anu, and also with Brandon K. It took my homework partner and I around four hours to figure out the first command.
Anu explained a lot of what she had worked out for the Application file and then from there I was able to figure out the majority of the Product file. With the Application process, she explained to me what she had done and I attempted to write the commands on my own computer to execute it. She explained to me that when looking at the file, the first step is to replace all tabs with a character and then any spaces afterwards. We couldn't just replace each tab/space combo with a comma, however, because commas and other characters exist within the data itself in the Sponsor Applicant column. Initially a star was chosen to replace these spaces/tabs; however, later it became apparent that there were stars in the data itself, so they were replaced with tildes ~.
cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ \~/~/g" | sed "s/\~ /~/g"
The above sequence of commands produced the following in sed
:
207917~N~GALDERMA LABS~False~False~AP~5~S~ 207925~N~VERTEX PHARMS INC~False~False~AP~3~P~ 207926~N~AKORN INC~False~False~AP~7~S~ 207931~N~ABBVIE INC~False~False~AP~5~P~ 207946~N~JANSSEN PHARMS~False~False~AP~5~P~ 207949~N~ACCORD HLTHCARE INC~False~False~TA~5~~
It then became apparent that there were some instances where there were two or three tildes in a row that needed to be replaced with the word "null" indicating empty values, as leaving that column empty would translate into an error when applied in PostgreSQL. This was done by adding the command sed "s/\~\~/~null~/g" | sed "s/\~\r$/~null/g"
.
Next it was clear that each tilde needed to be replaced with a comma. This command was found in the More Text Processing Features wiki because we needed to specify how to replace a certain instance in each line of data with a comma and/or apostrophe combo. Because certain features need to be interpreted as text they require apostrophes around them (e.g. Sponsor Applicant column), the sed command sed "s/pattern/replacement/instance to be replaced"
was used. At first I tried sed "s/~/,'/1" | sed "s/~/','/2"
. This only changed the first instance and the third instance, reminding me that it would always be the first instance as the computer reads the commands in order. Apostrophes needed to be added around the ApplType, Sponsor Applicant, Action Type, Ther Potential, and Orphan Code columns. The line then looked like this:
cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ \~/~/g" | sed "s/\~ /~/g" | sed "s/\~\~/~null~/g" | sed "s/\~\~/~null~/g" | sed "s/\~\r$/~null/g" | sed "s/~/,'/1" | sed "s/~/','/1" | sed "s/~/',/1" | sed "s/~/,/1" | sed "s/~/,'/1" | sed "s/~/',/1" | sed "s/~/,'/1" | sed "s/~/','/1" |sed "s/~/,'/1" |
Which yielded this:
207946,'N','JANSSEN PHARMS',False,False,'AP',5,'P','null 207949,'N','ACCORD HLTHCARE INC',False,False,'TA',5,'null','null 207955,'A','SPEAR PHARMS INC',False,False,'AP',null,'null','null 207958,'N','APRECIA PHARMS CO',False,False,'AP',3,'P','null
Next, another apostrophe needed to be added to the end of each line, and each instance of 'null
or null'
needed to be replaced with simply the word null. This was done by adding the commands sed "s/\r$/'/g" | sed "s/'null'/null/g" | sed "s/'null/null/g"
to the end of the command line. Then the data was inserted into the application based on what we did in class as well as in the Prepare Data for Insertion section of the PostgreSQL Tutorial page. Finally, the first line needed to be deleted as indicated in the Week 6 assignment page, which was done using the command sed ID
.
Command to convert Application (written by Anu with edits from Kristin)
cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ \~/~/g" | sed "s/\~ /~/g" | sed "s/\~\~/~null~/g" | sed "s/\~\~/~null~/g" | sed "s/\~\r$/~null/g" | sed "s/~/,'/1" | sed "s/~/','/1" | sed "s/~/',/1" | sed "s/~/,/1" | sed "s/~/,'/1" | sed "s/~/',/1" | sed "s/~/,'/1" | sed "s/~/','/1" |sed "s/~/,'/1" | sed "s/\r$/'/g" | sed "s/'null'/null/g" | sed "s/'null/null/g" | sed "s/^/insert into application(ApplNo,ApplType, SponsorApplicant, MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential, Orphan_Code) values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/application.sql.txt
I was able to visit my page at http://my.cs.lmu.edu/~kzebrows/application.sql.txt
, select all, and then add the data to PostgreSQL successfully.
Next, to determine the sed
command that would enable me to insert the Product data into PostgreSQL, I started with similar protocol as in the Application file. First, I needed to replace all tabs/spaces with a character. In this case it was all right to use a star (although it had to be designated as \*) because the only stars appeared in the Federal Register warnings as I will address later. It was very similar to the Application command except for the placement of some of the apostrophe/comma combos. In the case of this file, apostrophes needed to be placed around the Form, Dosage, TECode, drugname, and activeingred column. Once I figured out that the TECode column is varchar, not int, the sequence of commands that I tried went like this:
cat Product.txt | sed "s/\t/*/g" | sed -r "s/( ){4}//g" | sed -r "s/( ){2}//g" | sed "s/ \*/*/g" | sed "s/\* /*/g" | sed "s/\*\*/*null*/g" | sed "s/\*\*/*null*/g" | sed "s/\*/,/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\*/',/1" | sed "s/\*/,'/1" | sed "s/\*/',/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\r$/'/g" |
This gave me the following output:
207955,001,'GEL;TOPICAL','0.05%',1,'AB',0,'TRETINOIN','TRETINOIN' 207958,001,'TABLET;ORAL','250MG',1,'null',0,'SPRITAM','LEVETIRACETAM' 207958,002,'TABLET;ORAL','500MG',1,'null',0,'SPRITAM','LEVETIRACETAM' 207958,003,'TABLET;ORAL','750MG',1,'null',0,'SPRITAM','LEVETIRACETAM' 207958,004,'TABLET;ORAL','1GM',1,'null',0,'SPRITAM','LEVETIRACETAM'
It appeared as though the data was ready to be inserted into the PostgreSQL table from here. I tried it using the same protocol as before and received an error message that the star was appearing in the data set itself. I searched for these instances and found that they occurred in the following forms:
**Federal Register determination that product was not discontinued or withdrawn for safety or efficacy reasons** **Federal Register determination that product was not discontinued or withdrawn for safety or efficacy** **Federal Register notice determination that product was not discontinued or withdrawn for safety or efficacy reasons** **Federal Registration determination that product was not discontinued or withdrawn for safety or efficacy reasons**
Unfortunately this note, while effectively the same message, was not consistent. We had to remove each of them manually using sed "s/\*\*message\*\*//g"
.
I tried running it again and received another error message, this time due to another note in the database. I removed it in a similar fashion using the command sed "s/See current Annual Edition, 1.8 Description of Special Situations, Levothyroxine Sodium//g"
. Note that these deletions had to be done at the beginning of the sequence so they did not prove to be a problem later on in the sequence.
After deleting the first line using the command sed 1D
, we were finally able to insert this data into the system by adding sed "s/^/insert into product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug,drugname,activeingred) values(/g"
.
Final command sequence to convert Product (written by Kristin with edits from Anu)
cat Product.txt | sed "s/\*\*Federal Register determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/\*\*Federal Register determination that product was not discontinued or withdrawn for safety or efficacy\*\*//g" | sed "s/\*\*Federal Register notice determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/\*\*Federal Registration determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/See current Annual Edition, 1.8 Description of Special Situations, Levothyroxine Sodium//g" | sed "s/'//g" | sed "s/\t/*/g" | sed -r "s/( ){4}//g" | sed -r "s/( ){2}//g" | sed "s/ \*/*/g" | sed "s/\* /*/g" | sed "s/\*\*/*null*/g" | sed "s/\*\*/*null*/g" | sed "s/\*/,/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\*/',/1" | sed "s/\*/,'/1" | sed "s/\*/',/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\r$/'/g" | sed "s/^/insert into product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug,drugname,activeingred) values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/Product.sql.txt
This portion of the assignment presented me with many challenges. As a biology major with no prior experience, I'm still not completely comfortable with the command line, although things that were difficult earlier in the semester are second nature to me now, which shows me that I'm making progress in learning how to use this new language. My homework partner was very helpful as she has had prior experience with coding, but without her it would have been much harder. Each file came with its own difficulties: with the Product.txt file, there was a lot of trial and error once actually trying to insert the data into PostgreSQL because of the different notes and deletions we had to make as a result. In the Application file, trying to delete spaces/tabs combo was a challenge. As Dr. Dionisio helpfully advised, we could have replaced all instances of \t using the command sed "s/\t/','/g"
. Then, however, we would have to delete the ones we didn't need manually. Instead of trying that and then working backwards we decided to add each comma/apostrophe combo manually, although we recognize that we could have done it the other way, too.