Difference between revisions of "Blitvak Week 6"

From LMU BioDB 2015
Jump to: navigation, search
(adding some details about product.txt)
(added output link, cleaning)
Line 67: Line 67:
 
**Inserting <code>sed "s/'AP',null,'V'/'AP',null,null,'V'/g"</code> near the end of the pipe might correct this issue.
 
**Inserting <code>sed "s/'AP',null,'V'/'AP',null,null,'V'/g"</code> near the end of the pipe might correct this issue.
 
*The fix worked and the final pipe resulted in a successful table creation. The '''final''' pipeline of commands is:
 
*The fix worked and the final pipe resulted in a successful table creation. The '''final''' pipeline of commands is:
<code>cat application.txt | sed "s/,//g" | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4" | sed "s/nullS/null,S/g" | sed "s/nullV/null,V/g" | sed "s/nullP/null,P/g" | sed "s/,'..',.,/&'/g" | sed "s/'null/null/g" | sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" | sed "s/,V/,'V'/g" | sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev | sed "s/,S,/,'S',/;s/,P,/,'P',/;s/,S\*,/,'S\*',/;s/,P\*,/,'P\*',/g" | sed "s/'AP',null,'V'/'AP',null,null,'V'/g"</code>                                                            
+
<code>cat application.txt | sed "s/,//g" | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4" | sed "s/nullS/null,S/g" | sed "s/nullV/null,V/g" | sed "s/nullP/null,P/g" | sed "s/,'..',.,/&'/g" | sed "s/'null/null/g" | sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" | sed "s/,V/,'V'/g" | sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev | sed "s/,S,/,'S',/;s/,P,/,'P',/;s/,S\*,/,'S\*',/;s/,P\*,/,'P\*',/g" | sed "s/'AP',null,'V'/'AP',null,null,'V'/g"</code>
 +
*I replaced the previously made ''application.sql.txt'' with the output of the '''final''' pipeline: [http://my.cs.lmu.edu/~blitvak/application.sql.txt Output]                                                         
 
   
 
   
> ~public_html/application.sql.txt
 
For ''Product.txt'', the column labels were:
 
  
 
===Working with ''Product.txt''===
 
===Working with ''Product.txt''===

Revision as of 04:36, 12 October 2015

Individual Journal Assignment Week 6

Downloading and Decompressing Data Files, Other Assignment Preparation

Working with application.txt

Opening and Reviewing the File

  • I opened up and reviewed the application.txt and Product.txt files using more <filename.txt>; I found that the data column labels for application.txt are:
ApplNo  ApplType  SponsorApplicant  MostRecentLabelAvailableFlag  CurrentPatentFlag  ActionType  Chemical_Type  Ther_Potential  Orphan_Code
  • Reviewing the actual data, and with PostgreSQL in mind, I found that the variable type for each column should be:
ApplNo: int(primary key)  ApplType: varchar  SponsorApplicant: varchar  MostRecentLabelAvailableFlag: boolean  CurrentPatentFlag: boolean  ActionType: varchar  Chemical_Type: int 
Ther_Potential: varchar  Orphan_Code: varchar          
  • I realized that any empty data spaces in application.txt will have to be turned into null
  • I realized that sed "1D" will have to be executed in order to remove the first row (which is column labeling)
  • Referencing the Week 6 Assignment Page, I learned that the data within these text files is separated by tabs (\t) instead of commas

Modifying application.txt

  • I opened the file and decided to try to turn the tabs into commas using cat application.txt | sed "s/\t/,/g"
    • Adding onto that, I decided to get rid of the other spaces between the data by using sed -e "s/\s\{4,\}//g", which matches 4 whitespaces and removes them. This command was found in a StackExchange post.
  • At this point, I noticed that many lines had extra commas either at the ends of the lines or in the middle, indicating missing or nonexistent values
    • I used sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g", along with what I already have, in order to turn any extra commas (missing values) into null
  • My focus now turned to making sure that all data that was identified as being varchar has apostrophes wrapping around it.
    • Working with the ApplType data and my previous work, I used sed "s/......,/&'/1" | sed "s/'./&'/1" to surround it with apostrophes.
    • Working with the SponsorApplicant data, I added sed "s/',/&'/g" | sed "s/,./'&/3" to wrap apostrophes around the third data type in each line
    • I used sed "s/,/&'/5" | sed "s/'../&'/4", to place apostrophes around the two character data under ActionType
  • Using grep "nullP", grep "nullS", and grep "nullV" with the current pipeline of commands:
    • cat application.txt | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4"
    • I noticed that some of the null values are not separated from adjacent data with commas; I added sed "s/nullS/null,S/g" | sed "s/nullV/null,V/g" | sed "s/nullP/null,P/g" to fix this issue and I checked the result using grep
    • With the Ther_Potential data being now completely separated by commas from the other data, I then proceeded to try to surround it with apostrophes. I first added sed "s/,'..',.,/&'/g" to add the first apostrophe, and I noticed that this command led some null values to gain an apostrophe. I added sed "s/'null/null/g"/<code> to clean them up. I later noticed that some Ther_potential values have asterisks tied to them, I used <code>grep "P\*" and grep "S\*" to confirm the presence of asterisks. Finally, I added sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" to the pipeline to fully surround the Ther_Potential values.
    • I surrounded the Orphan_Code variable with apostrophes by adding sed "s/,V/,'V'/g" to the pipeline (Orphan_Code is often null but when it is present, it is always a V)
  • I finished formatting the file by adding sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev to the pipeline. I decided to reverse the file in this set of commands because I could not get sed to add ); to the very end of each line; I decided upon creating a multiple choice sed command that worked on the very first match of a line.

Generating the application.sql.txt file

  • I generated the application.sql.txt file using cat application.txt | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4" | sed "s/nullS/null,S/g" | sed "s/nullV/null,V/g" | sed "s/nullP/null,P/g" | sed "s/,'..',.,/&'/g" | sed "s/'null/null/g" | sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" | sed "s/,V/,'V'/g" | sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev > ~/public_html/application.sql.txt
  • I proceeded to input http://my.cs.lmu.edu/~blitvak/application.sql.txt into my browser and I copied all of the data

Testing application.sql.txt in pgAdmin III

  • I created a table in postgreSQL using create table applications (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)
  • I pasted the data into the program and noticed these types of errors:

1. Type 1

ERROR:  syntax error at or near "lse"
LINE 17882: ... values(125422,'B','THROMBOGENICS', INC,False,'Fa'lse,AP,nul...

2. Type 2

ERROR:  column "s" does not exist
LINE 161: ...e) values(008319,'N','NOVARTIS',False,False,'AP',14,S,null);
ERROR:  column "s" does not exist
LINE 360: ...10721,'N','CITRON PHARMA LLC',False,False,'AP',null,S,null);
ERROR:  column "p" does not exist
LINE 617: ...de) values(013025,'N','HOSPIRA',False,False,'AP',14,P,null);
  • The Type 1 Error, I noticed was linked to the presence of commas in the original raw data with respect to the many company names, should be fixed by placing sed "s/,//g" near the beginning of the pipe.
  • The Type 2 Error seems to be tied to the presence of a double digit Chemical_Type; it might be remedied by adding sed "s/,S,/,'S',/;s/,P,/,'P',/;s/,S\*,/,'S\*',/;s/,P\*,/,'P\*',/g" near the end of the pipe.
    • The modified pipe to be tested again is:

cat application.txt | sed "s/,//g" | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4" | sed "s/nullS/null,S/g" | sed "s/nullV/null,V/g" | sed "s/nullP/null,P/g" | sed "s/,'..',.,/&'/g" | sed "s/'null/null/g" | sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" | sed "s/,V/,'V'/g" | sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev | sed "s/,S,/,'S',/;s/,P,/,'P',/;s/,S\*,/,'S\*',/;s/,P\*,/,'P\*',/g"


  • This new pipeline revealed another error that I did not notice earlier (but resulted in the fixing of the previous errors):
ERROR:  INSERT has more target columns than expressions
LINE 4864: ...atentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Cod...
insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(125057,'B','ABBVIE INC',False,False,'AP',null,'V');
  • It seems that every line with a null instead of Chemical_Type value, prior to an Orphan_Code, is missing a null value (common pattern is 'AP',null,'V')
    • Inserting sed "s/'AP',null,'V'/'AP',null,null,'V'/g" near the end of the pipe might correct this issue.
  • The fix worked and the final pipe resulted in a successful table creation. The final pipeline of commands is:

cat application.txt | sed "s/,//g" | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4" | sed "s/nullS/null,S/g" | sed "s/nullV/null,V/g" | sed "s/nullP/null,P/g" | sed "s/,'..',.,/&'/g" | sed "s/'null/null/g" | sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" | sed "s/,V/,'V'/g" | sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev | sed "s/,S,/,'S',/;s/,P,/,'P',/;s/,S\*,/,'S\*',/;s/,P\*,/,'P\*',/g" | sed "s/'AP',null,'V'/'AP',null,null,'V'/g"

  • I replaced the previously made application.sql.txt with the output of the final pipeline: Output


Working with Product.txt

Opening and Reviewing Product.txt

  • Product.txt was examined using more Product.txt and the column labels and variable type were found to be:
ApplNo(int)    ProductNo(int)    Form(varchar)    Dosage(varchar)    ProductMktStatus(int)    TECode(varchar)    ReferenceDrug(int)    drugname(varchar)    activeingred(varchar)

Modifying Product.txt

  • I removed the first row (column labels) and replaced all of the tabs with commas; I also turned the extra commas into null values: cat Product.txt | sed "1D" | sed "s/\t/,/g" | sed "s/,,/,null,/g"
  • I began placing the apostrophes around the varchar data with the Form, Dosage, and TECode data. I placed apostrophes around this data by adding several sed replacements to the pipeline made earlier: sed "s/,/&'/2" | sed "s/,/'&/3" | sed "s/,/&'/3" | sed "s/,/'&/4" | sed "s/,/&'/5" | sed "s/,/'&/6"

Defining the appropriate tables for the Application and Product entities

Process the data files for these entities then load them into those tables

Questions Regarding Database Creation