Blitvak Week 6

Individual Journal Assignment Week 6

Downloading and Decompressing Data Files, Other Assignment Preparation

Looking over the Week 6 Assignment Page, I found that curl -O http://www.fda.gov/downloads/Drugs/InformationOnDrugs/UCM054599.zip can be used, while I am in the /nfs/home/blitvak directory, to place a .zip containing the data files directly into my personal folder. I found that unzip UCM054599.zip unzips the files into the personal folder.
I also downloaded and installed pgAdmin III from http://www.pgadmin.org/
I spent some time reviewing the PostgreSQL Tutorial
I booted up PuTTy and unzipped the files into my home folder

Working with application.txt

Opening and Reviewing the File

I opened up and reviewed the application.txt and Product.txt files using more <filename.txt>; I found that the data column labels for application.txt are:

ApplNo  ApplType  SponsorApplicant  MostRecentLabelAvailableFlag  CurrentPatentFlag  ActionType  Chemical_Type  Ther_Potential  Orphan_Code

Reviewing the actual data, and with PostgreSQL in mind, I found that the variable type for each column should be:

ApplNo: int(primary key)  ApplType: varchar  SponsorApplicant: varchar  MostRecentLabelAvailableFlag: boolean  CurrentPatentFlag: boolean  ActionType: varchar  Chemical_Type: int 
Ther_Potential: varchar  Orphan_Code: varchar

I realized that any empty data spaces in application.txt will have to be turned into null
I realized that sed "1D" will have to be executed in order to remove the first row (which is column labeling)
Referencing the Week 6 Assignment Page, I learned that the data within these text files is separated by tabs (\t) instead of commas

Modifying application.txt

I opened the file and decided to try to turn the tabs into commas using cat application.txt | sed "s/\t/,/g"
- Adding onto that, I decided to get rid of the other spaces between the data by using sed -e "s/\s\{4,\}//g", which matches 4 whitespaces and removes them. This command was found in a StackExchange post.
At this point, I noticed that many lines had extra commas either at the ends of the lines or in the middle, indicating missing or nonexistent values
- I used sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g", along with what I already have, in order to turn any extra commas (missing values) into null
My focus now turned to making sure that all data that was identified as being varchar has apostrophes wrapping around it.
- Working with the ApplType data and my previous work, I used sed "s/......,/&'/1" | sed "s/'./&'/1" to surround it with apostrophes.
- Working with the SponsorApplicant data, I added sed "s/',/&'/g" | sed "s/,./'&/3" to wrap apostrophes around the third data type in each line
- I used sed "s/,/&'/5" | sed "s/'../&'/4", to place apostrophes around the two character data under ActionType
Using grep "nullP" and grep "nullS" with the current pipeline of commands:
- cat application.txt | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4"
- I noticed that some of the null values are not separated from adjacent data with commas; I added sed "s/nullS/null,S/g" | sed "s/nullP/null,P/g" to fix this issue and I checked the result using grep
- With the Ther_Potential data being now completely separated by commas from the other data, I then proceeded to try to surround it with apostrophes. I first added sed "s/,'..',.,/&'/g" to add the first apostrophe, and I noticed that this command led some null values to gain an apostrophe. I added sed "s/'null/null/g"/<code> to clean them up. I later noticed that some Ther_potential values have asterisks tied to them, I used <code>grep "P\*" and grep "S\*" to confirm the presence of asterisks. Finally, I added sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" to the pipeline to fully surround the Ther_Potential values.
- I surrounded the Orphan_Code variable with apostrophes by adding sed "s/,V/,'V'/g" to the pipeline (Orphan_Code is often null but when it is present, it is always a V)
I finished formatting the file by adding sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev to the pipeline. I decided to reverse the file in this set of commands because I could not get sed to add ); to the very end of each line; I decided upon creating a multiple choice sed command that worked on the very first match of a line.

Blitvak Week 6

Contents

Individual Journal Assignment Week 6

Downloading and Decompressing Data Files, Other Assignment Preparation

Working with application.txt

Opening and Reviewing the File

Modifying application.txt

Generating the application.sql.txt file

Defining the appropriate tables for the Application and Product entities

Process the data files for these entities then load them into those tables

Questions Regarding Database Creation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools