Difference between revisions of "Blitvak Week 6"
 (some changes to errors section)  | 
				 (added final pipeline of commands for application.txt)  | 
				||
| Line 57: | Line 57: | ||
*The Type 2 Error seems to be tied to the presence of a double digit ''Chemical_Type''; it might be remedied by adding <code>sed "s/,S,/,'S',/;s/,P,/,'P',/;s/,S\*,/,'S\*',/;s/,P\*,/,'P\*',/g"</code> near the end of the pipe.  | *The Type 2 Error seems to be tied to the presence of a double digit ''Chemical_Type''; it might be remedied by adding <code>sed "s/,S,/,'S',/;s/,P,/,'P',/;s/,S\*,/,'S\*',/;s/,P\*,/,'P\*',/g"</code> near the end of the pipe.  | ||
**The modified pipe to be tested again is:    | **The modified pipe to be tested again is:    | ||
| − | cat application.txt | sed "s/,//g" | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4" | sed "s/nullS/null,S/g" | sed "s/nullV/null,V/g" | sed "s/nullP/null,P/g" | sed "s/,'..',.,/&'/g" | sed "s/'null/null/g" | sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" | sed "s/,V/,'V'/g" | sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev | sed "s/,S,/,'S',/;s/,P,/,'P',/;s/,S\*,/,'S\*',/;s/,P\*,/,'P\*',/g"  | + | <code>cat application.txt | sed "s/,//g" | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4" | sed "s/nullS/null,S/g" | sed "s/nullV/null,V/g" | sed "s/nullP/null,P/g" | sed "s/,'..',.,/&'/g" | sed "s/'null/null/g" | sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" | sed "s/,V/,'V'/g" | sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev | sed "s/,S,/,'S',/;s/,P,/,'P',/;s/,S\*,/,'S\*',/;s/,P\*,/,'P\*',/g"</code>  | 
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| + | *This new pipeline revealed another error that I did not notice earlier (but resulted in the fixing of the previous errors):  | ||
| + |  ERROR:  INSERT has more target columns than expressions  | ||
| + |  LINE 4864: ...atentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Cod...  | ||
| + |  insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(125057,'B','ABBVIE INC',False,False,'AP',null,'V');  | ||
| + | *It seems that every line with a null instead of ''Chemical_Type'' value, prior to an ''Orphan_Code'', is missing a null value (common pattern is <code>'AP',null,'V'</code>)  | ||
| + | **Inserting <code>sed "s/'AP',null,'V'/'AP',null,null,'V'/g"</code> near the end of the pipe might correct this issue.  | ||
| + | *The fix worked and the final pipe resulted in a successful table creation. The '''final''' pipeline of commands is:  | ||
| + | <code>cat application.txt | sed "s/,//g" | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4" | sed "s/nullS/null,S/g" | sed "s/nullV/null,V/g" | sed "s/nullP/null,P/g" | sed "s/,'..',.,/&'/g" | sed "s/'null/null/g" | sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" | sed "s/,V/,'V'/g" | sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev | sed "s/,S,/,'S',/;s/,P,/,'P',/;s/,S\*,/,'S\*',/;s/,P\*,/,'P\*',/g" | sed "s/'AP',null,'V'/'AP',null,null,'V'/g"</code>                                                                | ||
> ~public_html/application.sql.txt  | > ~public_html/application.sql.txt  | ||
Revision as of 01:54, 12 October 2015
Contents
- 1 Individual Journal Assignment Week 6
- 1.1 Downloading and Decompressing Data Files, Other Assignment Preparation
 - 1.2 Working with application.txt
 - 1.3 Defining the appropriate tables for the Application and Product entities
 - 1.4 Process the data files for these entities then load them into those tables
 - 1.5 Questions Regarding Database Creation
 
 
Individual Journal Assignment Week 6
Downloading and Decompressing Data Files, Other Assignment Preparation
- Looking over the  Week 6 Assignment Page, I found that 
curl -O http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zipcan be used, while I am in the /nfs/home/blitvak directory, to place a .zip containing the data files directly into my personal folder. I found thatunzip UCM054599.zipunzips the files into the personal folder. - I also downloaded and installed pgAdmin III from http://www.pgadmin.org/
 - I spent some time reviewing the PostgreSQL Tutorial
 - I booted up PuTTy and unzipped the files into my home folder
 
Working with application.txt
Opening and Reviewing the File
- I opened up and reviewed the application.txt and Product.txt files using 
more <filename.txt>; I found that the data column labels for application.txt are: 
ApplNo ApplType SponsorApplicant MostRecentLabelAvailableFlag CurrentPatentFlag ActionType Chemical_Type Ther_Potential Orphan_Code
- Reviewing the actual data, and with PostgreSQL in mind, I found that the variable type for each column should be:
 
ApplNo: int(primary key) ApplType: varchar SponsorApplicant: varchar MostRecentLabelAvailableFlag: boolean CurrentPatentFlag: boolean ActionType: varchar Chemical_Type: int Ther_Potential: varchar Orphan_Code: varchar
- I realized that any empty data spaces in application.txt will have to be turned into 
null - I realized that 
sed "1D"will have to be executed in order to remove the first row (which is column labeling) - Referencing the  Week 6 Assignment Page, I learned that the data within these text files is separated by tabs (
\t) instead of commas 
Modifying application.txt
- I opened the file and decided to try to turn the tabs into commas using 
cat application.txt | sed "s/\t/,/g"- Adding onto that, I decided to get rid of the other spaces between the data by using 
sed -e "s/\s\{4,\}//g", which matches 4 whitespaces and removes them. This command was found in a StackExchange post. 
 - Adding onto that, I decided to get rid of the other spaces between the data by using 
 - At this point, I noticed that many lines had extra commas either at the ends of the lines or in the middle, indicating missing or nonexistent values
- I used 
sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g", along with what I already have, in order to turn any extra commas (missing values) into null 
 - I used 
 - My focus now turned to making sure that all data that was identified as being varchar has apostrophes wrapping around it.
- Working with the ApplType data and my previous work, I used 
sed "s/......,/&'/1" | sed "s/'./&'/1"to surround it with apostrophes. - Working with the SponsorApplicant data, I added 
sed "s/',/&'/g" | sed "s/,./'&/3"to wrap apostrophes around the third data type in each line - I used 
sed "s/,/&'/5" | sed "s/'../&'/4", to place apostrophes around the two character data under ActionType 
 - Working with the ApplType data and my previous work, I used 
 - Using 
grep "nullP",grep "nullS", andgrep "nullV"with the current pipeline of commands:cat application.txt | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4"- I noticed that some of the null values are not separated from adjacent data with commas; I added 
sed "s/nullS/null,S/g" | sed "s/nullV/null,V/g" | sed "s/nullP/null,P/g"to fix this issue and I checked the result usinggrep - With the Ther_Potential data being now completely separated by commas from the other data, I then proceeded to try to surround it with apostrophes. I first added 
sed "s/,'..',.,/&'/g"to add the first apostrophe, and I noticed that this command led some null values to gain an apostrophe. I addedsed "s/'null/null/g"/<code> to clean them up. I later noticed that some Ther_potential values have asterisks tied to them, I used <code>grep "P\*"andgrep "S\*"to confirm the presence of asterisks. Finally, I addedsed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g"to the pipeline to fully surround the Ther_Potential values. - I surrounded the Orphan_Code variable with apostrophes by adding 
sed "s/,V/,'V'/g"to the pipeline (Orphan_Code is often null but when it is present, it is always a V) 
 - I finished formatting the file by adding 
sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | revto the pipeline. I decided to reverse the file in this set of commands because I could not getsedto add ); to the very end of each line; I decided upon creating a multiple choicesedcommand that worked on the very first match of a line. 
Generating the application.sql.txt file
- I generated the application.sql.txt file using 
cat application.txt | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4" | sed "s/nullS/null,S/g" | sed "s/nullV/null,V/g" | sed "s/nullP/null,P/g" | sed "s/,'..',.,/&'/g" | sed "s/'null/null/g" | sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" | sed "s/,V/,'V'/g" | sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev > ~/public_html/application.sql.txt - I proceeded to input 
http://my.cs.lmu.edu/~blitvak/application.sql.txtinto my browser and I copied all of the data 
Testing application.sql.txt in pgAdmin III
- I created a table in postgreSQL using 
create table applications (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar) - I pasted the data into the program and noticed these types of errors:
 
1. Type 1
ERROR: syntax error at or near "lse" LINE 17882: ... values(125422,'B','THROMBOGENICS', INC,False,'Fa'lse,AP,nul...
2. Type 2
ERROR: column "s" does not exist LINE 161: ...e) values(008319,'N','NOVARTIS',False,False,'AP',14,S,null); ERROR: column "s" does not exist LINE 360: ...10721,'N','CITRON PHARMA LLC',False,False,'AP',null,S,null); ERROR: column "p" does not exist LINE 617: ...de) values(013025,'N','HOSPIRA',False,False,'AP',14,P,null);
- The Type 1 Error, I noticed was linked to the presence of commas in the original raw data with respect to the many company names, should be fixed by placing 
sed "s/,//g"near the beginning of the pipe. - The Type 2 Error seems to be tied to the presence of a double digit Chemical_Type; it might be remedied by adding 
sed "s/,S,/,'S',/;s/,P,/,'P',/;s/,S\*,/,'S\*',/;s/,P\*,/,'P\*',/g"near the end of the pipe.- The modified pipe to be tested again is:
 
 
cat application.txt | sed "s/,//g" | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4" | sed "s/nullS/null,S/g" | sed "s/nullV/null,V/g" | sed "s/nullP/null,P/g" | sed "s/,'..',.,/&'/g" | sed "s/'null/null/g" | sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" | sed "s/,V/,'V'/g" | sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev | sed "s/,S,/,'S',/;s/,P,/,'P',/;s/,S\*,/,'S\*',/;s/,P\*,/,'P\*',/g"
- This new pipeline revealed another error that I did not notice earlier (but resulted in the fixing of the previous errors):
 
ERROR: INSERT has more target columns than expressions LINE 4864: ...atentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Cod... insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(125057,'B','ABBVIE INC',False,False,'AP',null,'V');
- It seems that every line with a null instead of Chemical_Type value, prior to an Orphan_Code, is missing a null value (common pattern is 
'AP',null,'V')- Inserting 
sed "s/'AP',null,'V'/'AP',null,null,'V'/g"near the end of the pipe might correct this issue. 
 - Inserting 
 - The fix worked and the final pipe resulted in a successful table creation. The final pipeline of commands is:
 
cat application.txt | sed "s/,//g" | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4" | sed "s/nullS/null,S/g" | sed "s/nullV/null,V/g" | sed "s/nullP/null,P/g" | sed "s/,'..',.,/&'/g" | sed "s/'null/null/g" | sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" | sed "s/,V/,'V'/g" | sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev | sed "s/,S,/,'S',/;s/,P,/,'P',/;s/,S\*,/,'S\*',/;s/,P\*,/,'P\*',/g" | sed "s/'AP',null,'V'/'AP',null,null,'V'/g"                                                              
> ~public_html/application.sql.txt For Product.txt, the column labels were: