Difference between revisions of "Blitvak Week 6"
From LMU BioDB 2015
(added large pipeline to first sect) |
(middle segment of application.txt) |
||
Line 8: | Line 8: | ||
===Working with ''application.txt''=== | ===Working with ''application.txt''=== | ||
+ | |||
+ | ====Opening and Reviewing the File==== | ||
*I opened up and reviewed the ''application.txt'' and ''Product.txt'' files using <code>more <filename.txt></code>; I found that the data column labels for ''application.txt'' are: | *I opened up and reviewed the ''application.txt'' and ''Product.txt'' files using <code>more <filename.txt></code>; I found that the data column labels for ''application.txt'' are: | ||
ApplNo ApplType SponsorApplicant MostRecentLabelAvailableFlag CurrentPatentFlag ActionType Chemical_Type Ther_Potential Orphan_Code | ApplNo ApplType SponsorApplicant MostRecentLabelAvailableFlag CurrentPatentFlag ActionType Chemical_Type Ther_Potential Orphan_Code | ||
Line 16: | Line 18: | ||
*I realized that <code>sed "1D"</code> will have to be executed in order to remove the first row (which is column labeling) | *I realized that <code>sed "1D"</code> will have to be executed in order to remove the first row (which is column labeling) | ||
*Referencing the [[Week 6| Week 6 Assignment Page]], I learned that the data within these text files is separated by ''tabs'' (<code>\t</code>) instead of commas | *Referencing the [[Week 6| Week 6 Assignment Page]], I learned that the data within these text files is separated by ''tabs'' (<code>\t</code>) instead of commas | ||
+ | ====Modifying ''application.txt''==== | ||
*I opened the file and decided to try to turn the tabs into commas using <code>cat application.txt | sed "s/\t/,/g"</code> | *I opened the file and decided to try to turn the tabs into commas using <code>cat application.txt | sed "s/\t/,/g"</code> | ||
**Adding onto that, I decided to get rid of the other spaces between the data by using <code>sed -e "s/\s\{4,\}//g"</code>, which matches 4 whitespaces and removes them. This command was found in a [http://superuser.com/questions/112834/how-to-match-whitespace-in-sed StackExchange] post. | **Adding onto that, I decided to get rid of the other spaces between the data by using <code>sed -e "s/\s\{4,\}//g"</code>, which matches 4 whitespaces and removes them. This command was found in a [http://superuser.com/questions/112834/how-to-match-whitespace-in-sed StackExchange] post. | ||
Line 24: | Line 27: | ||
**Working with the ''SponsorApplicant'' data, I added <code>sed "s/',/&'/g" | sed "s/,./'&/3"</code> to wrap apostrophes around the third data type in each line | **Working with the ''SponsorApplicant'' data, I added <code>sed "s/',/&'/g" | sed "s/,./'&/3"</code> to wrap apostrophes around the third data type in each line | ||
**I used <code>sed "s/,/&'/5" | sed "s/'../&'/4"</code>, to place apostrophes around the two character data under ''ActionType'' | **I used <code>sed "s/,/&'/5" | sed "s/'../&'/4"</code>, to place apostrophes around the two character data under ''ActionType'' | ||
− | + | *Using <code>grep "nullP"</code> and <code>grep "nullS"</code> with the current pipeline of commands: | |
− | <code>cat application.txt | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4" </code> | + | **<code>cat application.txt | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4" </code> |
+ | **I noticed that some of the null values are not separated from adjacent data with commas; I added <code>sed "s/nullS/null,S/g" | sed "s/nullP/null,P/g"</code> to fix this issue and I checked the result using <code>grep</code> | ||
+ | **With the ''Ther_Potential'' data being now completely separated by commas from the other data, I then proceeded to try to surround it with apostrophes. I first added <code>sed "s/,'..',.,/&'/g"</code> to add the first apostrophe, and I noticed that this command led some null values to gain an apostrophe. I added <code>sed "s/'null/null/g"/<code> to clean them up. I later noticed that some ''Ther_potential'' values have asterisks tied to them, I used <code>grep "P\*"</code> and <code>grep "S\*"</code> to confirm the presence of asterisks. Finally, I added <code>sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g"</code> to the pipeline to fully surround the ''Ther_Potential'' values. | ||
+ | **I surrounded the ''Orphan_Code'' variable with apostrophes by adding <code>sed "s/,V/,'V'/g"</code> to the pipeline (''Orphan_Code'' is often null but when it is present, it is always a ''V'') | ||
+ | *I finished formatting the file by adding <code>sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev</code> to the pipeline. I decided to reverse the file in this set of commands because I could not get <code>sed</code> to add '');'' to the very end of each line; I decided upon creating a multiple choice <code>sed</code> command that worked on the very first match of a line. | ||
+ | |||
+ | ====Generating the ''application.sql.txt'' file==== | ||
+ | *I generated the ''application.sql.txt'' file | ||
+ | |||
+ | |||
+ | |||
+ | |||
− | + | ||
+ | > ~public_html/application.sql.txt | ||
For ''Product.txt'', the column labels were: | For ''Product.txt'', the column labels were: | ||
Revision as of 00:12, 12 October 2015
Contents
- 1 Individual Journal Assignment Week 6
- 1.1 Downloading and Decompressing Data Files, Other Assignment Preparation
- 1.2 Working with application.txt
- 1.3 Defining the appropriate tables for the Application and Product entities
- 1.4 Process the data files for these entities then load them into those tables
- 1.5 Questions Regarding Database Creation
Individual Journal Assignment Week 6
Downloading and Decompressing Data Files, Other Assignment Preparation
- Looking over the Week 6 Assignment Page, I found that
curl -O http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zip
can be used, while I am in the /nfs/home/blitvak directory, to place a .zip containing the data files directly into my personal folder. I found thatunzip UCM054599.zip
unzips the files into the personal folder. - I also downloaded and installed pgAdmin III from http://www.pgadmin.org/
- I spent some time reviewing the PostgreSQL Tutorial
- I booted up PuTTy and unzipped the files into my home folder
Working with application.txt
Opening and Reviewing the File
- I opened up and reviewed the application.txt and Product.txt files using
more <filename.txt>
; I found that the data column labels for application.txt are:
ApplNo ApplType SponsorApplicant MostRecentLabelAvailableFlag CurrentPatentFlag ActionType Chemical_Type Ther_Potential Orphan_Code
- Reviewing the actual data, and with PostgreSQL in mind, I found that the variable type for each column should be:
ApplNo: int(primary key) ApplType: varchar SponsorApplicant: varchar MostRecentLabelAvailableFlag: boolean CurrentPatentFlag: boolean ActionType: varchar Chemical_Type: int Ther_Potential: varchar Orphan_Code: varchar
- I realized that any empty data spaces in application.txt will have to be turned into
null
- I realized that
sed "1D"
will have to be executed in order to remove the first row (which is column labeling) - Referencing the Week 6 Assignment Page, I learned that the data within these text files is separated by tabs (
\t
) instead of commas
Modifying application.txt
- I opened the file and decided to try to turn the tabs into commas using
cat application.txt | sed "s/\t/,/g"
- Adding onto that, I decided to get rid of the other spaces between the data by using
sed -e "s/\s\{4,\}//g"
, which matches 4 whitespaces and removes them. This command was found in a StackExchange post.
- Adding onto that, I decided to get rid of the other spaces between the data by using
- At this point, I noticed that many lines had extra commas either at the ends of the lines or in the middle, indicating missing or nonexistent values
- I used
sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g"
, along with what I already have, in order to turn any extra commas (missing values) into null
- I used
- My focus now turned to making sure that all data that was identified as being varchar has apostrophes wrapping around it.
- Working with the ApplType data and my previous work, I used
sed "s/......,/&'/1" | sed "s/'./&'/1"
to surround it with apostrophes. - Working with the SponsorApplicant data, I added
sed "s/',/&'/g" | sed "s/,./'&/3"
to wrap apostrophes around the third data type in each line - I used
sed "s/,/&'/5" | sed "s/'../&'/4"
, to place apostrophes around the two character data under ActionType
- Working with the ApplType data and my previous work, I used
- Using
grep "nullP"
andgrep "nullS"
with the current pipeline of commands:cat application.txt | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4"
- I noticed that some of the null values are not separated from adjacent data with commas; I added
sed "s/nullS/null,S/g" | sed "s/nullP/null,P/g"
to fix this issue and I checked the result usinggrep
- With the Ther_Potential data being now completely separated by commas from the other data, I then proceeded to try to surround it with apostrophes. I first added
sed "s/,'..',.,/&'/g"
to add the first apostrophe, and I noticed that this command led some null values to gain an apostrophe. I addedsed "s/'null/null/g"/<code> to clean them up. I later noticed that some Ther_potential values have asterisks tied to them, I used <code>grep "P\*"
andgrep "S\*"
to confirm the presence of asterisks. Finally, I addedsed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g"
to the pipeline to fully surround the Ther_Potential values. - I surrounded the Orphan_Code variable with apostrophes by adding
sed "s/,V/,'V'/g"
to the pipeline (Orphan_Code is often null but when it is present, it is always a V)
- I finished formatting the file by adding
sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev
to the pipeline. I decided to reverse the file in this set of commands because I could not getsed
to add ); to the very end of each line; I decided upon creating a multiple choicesed
command that worked on the very first match of a line.
Generating the application.sql.txt file
- I generated the application.sql.txt file
> ~public_html/application.sql.txt
For Product.txt, the column labels were: