Blitvak Week 6

From LMU BioDB 2015
Revision as of 00:12, 12 October 2015 by Blitvak (Talk | contribs) (middle segment of application.txt)

Jump to: navigation, search

Individual Journal Assignment Week 6

Downloading and Decompressing Data Files, Other Assignment Preparation

Working with application.txt

Opening and Reviewing the File

  • I opened up and reviewed the application.txt and Product.txt files using more <filename.txt>; I found that the data column labels for application.txt are:
ApplNo  ApplType  SponsorApplicant  MostRecentLabelAvailableFlag  CurrentPatentFlag  ActionType  Chemical_Type  Ther_Potential  Orphan_Code
  • Reviewing the actual data, and with PostgreSQL in mind, I found that the variable type for each column should be:
ApplNo: int(primary key)  ApplType: varchar  SponsorApplicant: varchar  MostRecentLabelAvailableFlag: boolean  CurrentPatentFlag: boolean  ActionType: varchar  Chemical_Type: int 
Ther_Potential: varchar  Orphan_Code: varchar          
  • I realized that any empty data spaces in application.txt will have to be turned into null
  • I realized that sed "1D" will have to be executed in order to remove the first row (which is column labeling)
  • Referencing the Week 6 Assignment Page, I learned that the data within these text files is separated by tabs (\t) instead of commas

Modifying application.txt

  • I opened the file and decided to try to turn the tabs into commas using cat application.txt | sed "s/\t/,/g"
    • Adding onto that, I decided to get rid of the other spaces between the data by using sed -e "s/\s\{4,\}//g", which matches 4 whitespaces and removes them. This command was found in a StackExchange post.
  • At this point, I noticed that many lines had extra commas either at the ends of the lines or in the middle, indicating missing or nonexistent values
    • I used sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g", along with what I already have, in order to turn any extra commas (missing values) into null
  • My focus now turned to making sure that all data that was identified as being varchar has apostrophes wrapping around it.
    • Working with the ApplType data and my previous work, I used sed "s/......,/&'/1" | sed "s/'./&'/1" to surround it with apostrophes.
    • Working with the SponsorApplicant data, I added sed "s/',/&'/g" | sed "s/,./'&/3" to wrap apostrophes around the third data type in each line
    • I used sed "s/,/&'/5" | sed "s/'../&'/4", to place apostrophes around the two character data under ActionType
  • Using grep "nullP" and grep "nullS" with the current pipeline of commands:
    • cat application.txt | sed "1D" | sed "s/\t/,/g" | sed -e "s/\s\{4,\}//g" | sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g" | sed "s/......,/&'/1" | sed "s/'./&'/1" | sed "s/',/&'/g" | sed "s/,./'&/3" | sed "s/,/&'/5" | sed "s/'../&'/4"
    • I noticed that some of the null values are not separated from adjacent data with commas; I added sed "s/nullS/null,S/g" | sed "s/nullP/null,P/g" to fix this issue and I checked the result using grep
    • With the Ther_Potential data being now completely separated by commas from the other data, I then proceeded to try to surround it with apostrophes. I first added sed "s/,'..',.,/&'/g" to add the first apostrophe, and I noticed that this command led some null values to gain an apostrophe. I added sed "s/'null/null/g"/<code> to clean them up. I later noticed that some Ther_potential values have asterisks tied to them, I used <code>grep "P\*" and grep "S\*" to confirm the presence of asterisks. Finally, I added sed "s/'S,null/'S',null/g" | sed "s/'P,null/'P',null/g" | sed "s/'S,V/'S',V/g" | sed "s/'P,V/'P',V/g" | sed "s/'P\*/'P\*'/g" | sed "s/'S\*/'S\*'/g" to the pipeline to fully surround the Ther_Potential values.
    • I surrounded the Orphan_Code variable with apostrophes by adding sed "s/,V/,'V'/g" to the pipeline (Orphan_Code is often null but when it is present, it is always a V)
  • I finished formatting the file by adding sed "s/^/insert into applications(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | rev | sed -r "s/llun|'V'/;)&/1" | rev to the pipeline. I decided to reverse the file in this set of commands because I could not get sed to add ); to the very end of each line; I decided upon creating a multiple choice sed command that worked on the very first match of a line.

Generating the application.sql.txt file

  • I generated the application.sql.txt file




> ~public_html/application.sql.txt For Product.txt, the column labels were:


Defining the appropriate tables for the Application and Product entities

Process the data files for these entities then load them into those tables

Questions Regarding Database Creation