Difference between revisions of "Kzebrows Week 6"
(Lab notebook entry question 1.) |
(Lab notebook entry convert Application sed to SQL) |
||
Line 1: | Line 1: | ||
==Questions to Answer== | ==Questions to Answer== | ||
− | For this week's assignment I worked in the | + | For this week's assignment I worked in the lab in Seaver to use the PostgreSQL tool installed on the computers. I used |
kzebrows@my.cs.lmu.edu | kzebrows@my.cs.lmu.edu | ||
followed by my password to log into the LMU CMSI database. | followed by my password to log into the LMU CMSI database. | ||
Line 9: | Line 9: | ||
'''Provide the DDL (<code>create table</code>) statements that you used for your ''application'' and ''product'' tables.''' | '''Provide the DDL (<code>create table</code>) statements that you used for your ''application'' and ''product'' tables.''' | ||
− | This required me to open the [[PostgreSQL Tutorial | | + | This required me to open the [[PostgreSQL Tutorial | PostgreSQL Tutorial]] wiki page. I then created the following commands in SQL, following the protocol we did in class with the movie database. The command began with <code>create table [application/Product]</code>. Then, as Application was the primary key, the first thing I entered in parentheses after the create table command was <code>(ApplNo int primary key</code>; however, because Product has a foreign key as indicated by the assignment page, I just used <code>(ApplNo int</code>. Then I entered the names of each column that I viewed by opening the Product and Application files on Notepad, followed by either ''int'' for integers, ''varchar'' for varying characters, and ''boolean'' for columns with True or False values. The following commands to create tables in SQL looked like this: |
Application: | Application: | ||
Line 15: | Line 15: | ||
create table Application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, | create table Application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, | ||
MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, | MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, | ||
− | Chemical_Type int, Ther_Potential | + | Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar) |
Product: | Product: | ||
Line 29: | Line 29: | ||
'''Provide the <code>sed</code> command sequences that you used to convert the raw text files into sequences of SQL <code>insert</code> statements.''' | '''Provide the <code>sed</code> command sequences that you used to convert the raw text files into sequences of SQL <code>insert</code> statements.''' | ||
+ | |||
+ | This part of the assignment required the most time and effort. I worked closely with my homework partner, Anu, and also with Brandon K. It took my homework partner and I around four hours to figure out the first command. | ||
+ | |||
+ | Anu explained a lot of what she had worked out for the Application file and then from there I was able to figure out the majority of the Product file. With the Application process, she explained to me what she had done and I attempted to write the commands on my own computer to execute it. She explained to me that when looking at the file, the first step is to replace all tabs with a character and then any spaces afterwards. We couldn't just replace each tab/space combo with a comma, however, because commas and other characters exist within the data itself in the Sponsor Applicant column. Initially a star was chosen to replace these spaces/tabs; however, later it became apparent that there were stars in the data itself, so they were replaced with tildes ~. | ||
+ | |||
+ | cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ \~/~/g" | sed "s/\~ /~/g" | ||
+ | |||
+ | The above sequence of commands produced the following in <code>sed</code>: | ||
+ | |||
+ | 207917~N~GALDERMA LABS~False~False~AP~5~S~ | ||
+ | 207925~N~VERTEX PHARMS INC~False~False~AP~3~P~ | ||
+ | 207926~N~AKORN INC~False~False~AP~7~S~ | ||
+ | 207931~N~ABBVIE INC~False~False~AP~5~P~ | ||
+ | 207946~N~JANSSEN PHARMS~False~False~AP~5~P~ | ||
+ | 207949~N~ACCORD HLTHCARE INC~False~False~TA~5~~ | ||
+ | |||
+ | It then became apparent that there were some instances where there were two or three tildes in a row that needed to be replaced with the word "null" indicating empty values, as leaving that column empty would translate into an error when applied in PostgreSQL. This was done by adding the command <code> sed "s/\~\~/~null~/g" | sed "s/\~\r$/~null/g"</code>. | ||
+ | |||
+ | Next it was clear that each tilde needed to be replaced with a comma. This command was found in the [[More Text Processing Features | More Text Processing Features]] wiki because we needed to specify how to replace a certain instance in each line of data with a comma and/or apostrophe combo. Because certain features need to be interpreted as text they require apostrophes around them (e.g. Sponsor Applicant column), the sed command <code>sed "s/pattern/replacement/instance to be replaced"</code> was used. At first I tried <code>sed "s/~/,'/1" | sed "s/~/','/2"</code. This only changed the first instance and the third instance, reminding me that it would always be the first instance as the computer reads the commands in order. Apostrophes needed to be added around the ApplType, Sponsor Applicant, Action Type, Ther Potential, and Orphan Code columns. The line then looked like this: | ||
+ | |||
+ | cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ \~/~/g" | sed "s/\~ /~/g" | | ||
+ | sed "s/\~\~/~null~/g" | sed "s/\~\~/~null~/g" | sed "s/\~\r$/~null/g" | sed "s/~/,'/1" | sed "s/~/','/1" | | ||
+ | sed "s/~/',/1" | sed "s/~/,/1" | sed "s/~/,'/1" | sed "s/~/',/1" | sed "s/~/,'/1" | sed "s/~/','/1" |sed "s/~/,'/1" | | ||
+ | |||
+ | Which yielded this: | ||
+ | |||
+ | 207946,'N','JANSSEN PHARMS',False,False,'AP',5,'P','null | ||
+ | 207949,'N','ACCORD HLTHCARE INC',False,False,'TA',5,'null','null | ||
+ | 207955,'A','SPEAR PHARMS INC',False,False,'AP',null,'null','null | ||
+ | 207958,'N','APRECIA PHARMS CO',False,False,'AP',3,'P','null | ||
+ | |||
+ | Next, another apostrophe needed to be added to the end of each line, and each instance of <code>'null</code> or <code>null'</code> needed to be replaced with simply the word ''null''. This was done by adding the commands <code>sed "s/\r$/'/g" | sed "s/'null'/null/g" | sed "s/'null/null/g"</code> to the end of the command line. Then the data was inserted into the application based on what we did in class as well as in the ''Prepare Data for Insertion'' section of the PostgreSQL Tutorial page.Finally, the first line needed to be deleted as indicated in the Week 6 assignment page, which was done using the command <code> sed ID </code>. | ||
Command to convert Application (written by Anu with edits from Kristin) | Command to convert Application (written by Anu with edits from Kristin) | ||
Line 38: | Line 70: | ||
SponsorApplicant, MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential, | SponsorApplicant, MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential, | ||
Orphan_Code) values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/application.sql.txt | Orphan_Code) values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/application.sql.txt | ||
+ | |||
+ | I was able to visit my | ||
Command to convert Product (written by Kristin with edits from Anu) | Command to convert Product (written by Kristin with edits from Anu) | ||
Line 53: | Line 87: | ||
product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug,drugname,activeingred) | product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug,drugname,activeingred) | ||
values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/Product.sql.txt | values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/Product.sql.txt | ||
+ | |||
+ | This portion of the assignment presented me with many challenges. As a biology major with no prior experience, I'm still not completely comfortable with the command line, although things that were difficult earlier in the semester are second nature to me now, which shows me that I'm making progress in learning how to use this new language. |
Revision as of 18:57, 13 October 2015
Questions to Answer
For this week's assignment I worked in the lab in Seaver to use the PostgreSQL tool installed on the computers. I used
kzebrows@my.cs.lmu.edu
followed by my password to log into the LMU CMSI database.
Next, following the Direct Download/Unzipping Commands section of the Week 6 Assignment, I downloaded the FDA drug file directly to my.cs.lmu.edu and unzipped it. I opened each file using the cat application.txt
or cat Product.txt
commands to view them first.
Provide the DDL (create table
) statements that you used for your application and product tables.
This required me to open the PostgreSQL Tutorial wiki page. I then created the following commands in SQL, following the protocol we did in class with the movie database. The command began with create table [application/Product]
. Then, as Application was the primary key, the first thing I entered in parentheses after the create table command was (ApplNo int primary key
; however, because Product has a foreign key as indicated by the assignment page, I just used (ApplNo int
. Then I entered the names of each column that I viewed by opening the Product and Application files on Notepad, followed by either int for integers, varchar for varying characters, and boolean for columns with True or False values. The following commands to create tables in SQL looked like this:
Application:
create table Application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)
Product:
create table Product (ApplNo int, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus varchar, TECode int, ReferenceDrug int, drugname varchar, activeingred varchar)
Note that originally, the TECode column was entered as int; however, upon creating the command (see next question) to convert the files from sed
to SQL insert
statements, I realized that because not everything in this code was an integer (SQL pointed this out to me with an instance of AA in the TECode column when I tried to insert the data into the SQL table) the command needed to be changed.
create table Product (ApplNo int, ProductNo int, Form varchar, Dosage varchar, ProductMktStatus varchar, TECode varchar, ReferenceDrug int, drugname varchar, activeingred varchar)
Provide the sed
command sequences that you used to convert the raw text files into sequences of SQL insert
statements.
This part of the assignment required the most time and effort. I worked closely with my homework partner, Anu, and also with Brandon K. It took my homework partner and I around four hours to figure out the first command.
Anu explained a lot of what she had worked out for the Application file and then from there I was able to figure out the majority of the Product file. With the Application process, she explained to me what she had done and I attempted to write the commands on my own computer to execute it. She explained to me that when looking at the file, the first step is to replace all tabs with a character and then any spaces afterwards. We couldn't just replace each tab/space combo with a comma, however, because commas and other characters exist within the data itself in the Sponsor Applicant column. Initially a star was chosen to replace these spaces/tabs; however, later it became apparent that there were stars in the data itself, so they were replaced with tildes ~.
cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ \~/~/g" | sed "s/\~ /~/g"
The above sequence of commands produced the following in sed
:
207917~N~GALDERMA LABS~False~False~AP~5~S~ 207925~N~VERTEX PHARMS INC~False~False~AP~3~P~ 207926~N~AKORN INC~False~False~AP~7~S~ 207931~N~ABBVIE INC~False~False~AP~5~P~ 207946~N~JANSSEN PHARMS~False~False~AP~5~P~ 207949~N~ACCORD HLTHCARE INC~False~False~TA~5~~
It then became apparent that there were some instances where there were two or three tildes in a row that needed to be replaced with the word "null" indicating empty values, as leaving that column empty would translate into an error when applied in PostgreSQL. This was done by adding the command sed "s/\~\~/~null~/g" | sed "s/\~\r$/~null/g"
.
Next it was clear that each tilde needed to be replaced with a comma. This command was found in the More Text Processing Features wiki because we needed to specify how to replace a certain instance in each line of data with a comma and/or apostrophe combo. Because certain features need to be interpreted as text they require apostrophes around them (e.g. Sponsor Applicant column), the sed command sed "s/pattern/replacement/instance to be replaced"
was used. At first I tried sed "s/~/,'/1" | sed "s/~/','/2"</code. This only changed the first instance and the third instance, reminding me that it would always be the first instance as the computer reads the commands in order. Apostrophes needed to be added around the ApplType, Sponsor Applicant, Action Type, Ther Potential, and Orphan Code columns. The line then looked like this:
cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ \~/~/g" | sed "s/\~ /~/g" | sed "s/\~\~/~null~/g" | sed "s/\~\~/~null~/g" | sed "s/\~\r$/~null/g" | sed "s/~/,'/1" | sed "s/~/','/1" | sed "s/~/',/1" | sed "s/~/,/1" | sed "s/~/,'/1" | sed "s/~/',/1" | sed "s/~/,'/1" | sed "s/~/','/1" |sed "s/~/,'/1" |
Which yielded this:
207946,'N','JANSSEN PHARMS',False,False,'AP',5,'P','null 207949,'N','ACCORD HLTHCARE INC',False,False,'TA',5,'null','null 207955,'A','SPEAR PHARMS INC',False,False,'AP',null,'null','null 207958,'N','APRECIA PHARMS CO',False,False,'AP',3,'P','null
Next, another apostrophe needed to be added to the end of each line, and each instance of <code>'null or null'
needed to be replaced with simply the word null. This was done by adding the commands sed "s/\r$/'/g" | sed "s/'null'/null/g" | sed "s/'null/null/g"
to the end of the command line. Then the data was inserted into the application based on what we did in class as well as in the Prepare Data for Insertion section of the PostgreSQL Tutorial page.Finally, the first line needed to be deleted as indicated in the Week 6 assignment page, which was done using the command sed ID
.
Command to convert Application (written by Anu with edits from Kristin)
cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ \~/~/g" | sed "s/\~ /~/g" | sed "s/\~\~/~null~/g" | sed "s/\~\~/~null~/g" | sed "s/\~\r$/~null/g" | sed "s/~/,'/1" | sed "s/~/','/1" | sed "s/~/',/1" | sed "s/~/,/1" | sed "s/~/,'/1" | sed "s/~/',/1" | sed "s/~/,'/1" | sed "s/~/','/1" |sed "s/~/,'/1" | sed "s/\r$/'/g" | sed "s/'null'/null/g" | sed "s/'null/null/g" | sed "s/^/insert into application(ApplNo,ApplType, SponsorApplicant, MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential, Orphan_Code) values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/application.sql.txt
I was able to visit my
Command to convert Product (written by Kristin with edits from Anu)
cat Product.txt | sed "s/\*\*Federal Register determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/\*\*Federal Register determination that product was not discontinued or withdrawn for safety or efficacy\*\*//g" | sed "s/\*\*Federal Register notice determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/\*\*Federal Registration determination that product was not discontinued or withdrawn for safety or efficacy reasons\*\*//g" | sed "s/See current Annual Edition, 1.8 Description of Special Situations, Levothyroxine Sodium//g" | sed "s/'//g" | sed "s/\t/*/g" | sed -r "s/( ){4}//g" | sed -r "s/( ){2}//g" | sed "s/ \*/*/g" | sed "s/\* /*/g" | sed "s/\*\*/*null*/g" | sed "s/\*\*/*null*/g" | sed "s/\*/,/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\*/',/1" | sed "s/\*/,'/1" | sed "s/\*/',/1" | sed "s/\*/,'/1" | sed "s/\*/','/1" | sed "s/\r$/'/g" | sed "s/^/insert into product(ApplNo,ProductNo,Form,Dosage,ProductMktStatus,TECode,ReferenceDrug,drugname,activeingred) values(/g" | sed "s/$/);/g" | sed 1D > ~/public_html/Product.sql.txt
This portion of the assignment presented me with many challenges. As a biology major with no prior experience, I'm still not completely comfortable with the command line, although things that were difficult earlier in the semester are second nature to me now, which shows me that I'm making progress in learning how to use this new language.