Difference between revisions of "Blitvak Week 6"

From LMU BioDB 2015
Jump to: navigation, search
(some more details into section involving application.txt)
(more details (commands) regarding application.txt)
Line 15: Line 15:
 
*I realized that any empty data spaces in ''application.txt'' will have to be turned into <code>null</code>
 
*I realized that any empty data spaces in ''application.txt'' will have to be turned into <code>null</code>
 
*I realized that <code>sed "1D"</code> will have to be executed in order to remove the first row (which is column labeling)
 
*I realized that <code>sed "1D"</code> will have to be executed in order to remove the first row (which is column labeling)
 +
*Referencing the [[Week 6| Week 6 Assignment Page]], I learned that the data within these text files is separated by ''tabs'' (<code>\t</code>) instead of commas
 +
*I opened the file and decided to try to turn the tabs into commas using <code>cat application.txt | sed "s/\t/,/g"</code>
 +
**Adding onto that, I decided to get rid of the other spaces between the data by using <code>sed -e "s/\s\{4,\}//g"</code>, which matches 4 whitespaces and removes them. This command was found in a [http://superuser.com/questions/112834/how-to-match-whitespace-in-sed StackExchange] post.
 +
*At this point, I noticed that many lines had extra commas either at the ends of the lines or in the middle, indicating missing or nonexistent values
 +
**I used <code>sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g"</code>, along with what I already have, in order to turn any extra commas (missing values) into ''null''
 +
*My focus now turned to making sure that all data that was identified as being ''varchar'' has apostrophes wrapping around it.
 +
**Working with the ''ApplType'' data and my previous work, I used <code>sed "s/......,/&'/1" | sed "s/'./&'/1"</code> to surround it with apostrophes.
 +
**Working with the ''SponsorApplicant'' data, I added <code>sed "s/',/&'/g" | sed "s/,./'&/3"</code> to wrap apostrophes around the third data type in each line
 +
 +
 +
  
  

Revision as of 21:30, 11 October 2015

Individual Journal Assignment Week 6

Downloading and Decompressing Data Files, Other Assignment Preparation

Working with application.txt

  • I opened up and reviewed the application.txt and Product.txt files using more <filename.txt>; I found that the data column labels for application.txt are:
ApplNo  ApplType  SponsorApplicant  MostRecentLabelAvailableFlag  CurrentPatentFlag  ActionType  Chemical_Type  Ther_Potential  Orphan_Code
  • Reviewing the actual data, and with PostgreSQL in mind, I found that the variable type for each column should be:
ApplNo: int(primary key)  ApplType: varchar  SponsorApplicant: varchar  MostRecentLabelAvailableFlag: boolean  CurrentPatentFlag: boolean  ActionType: varchar  Chemical_Type: int 
Ther_Potential: varchar  Orphan_Code: varchar          
  • I realized that any empty data spaces in application.txt will have to be turned into null
  • I realized that sed "1D" will have to be executed in order to remove the first row (which is column labeling)
  • Referencing the Week 6 Assignment Page, I learned that the data within these text files is separated by tabs (\t) instead of commas
  • I opened the file and decided to try to turn the tabs into commas using cat application.txt | sed "s/\t/,/g"
    • Adding onto that, I decided to get rid of the other spaces between the data by using sed -e "s/\s\{4,\}//g", which matches 4 whitespaces and removes them. This command was found in a StackExchange post.
  • At this point, I noticed that many lines had extra commas either at the ends of the lines or in the middle, indicating missing or nonexistent values
    • I used sed "s/,,,\r$/,null,null,null/;s/,,\r$/,null,null/;s/,\r$/,null/1" | sed "s/,,/,null/;s/,,,/,null,null/g", along with what I already have, in order to turn any extra commas (missing values) into null
  • My focus now turned to making sure that all data that was identified as being varchar has apostrophes wrapping around it.
    • Working with the ApplType data and my previous work, I used sed "s/......,/&'/1" | sed "s/'./&'/1" to surround it with apostrophes.
    • Working with the SponsorApplicant data, I added sed "s/',/&'/g" | sed "s/,./'&/3" to wrap apostrophes around the third data type in each line



For Product.txt, the column labels were:


Defining the appropriate tables for the Application and Product entities

Process the data files for these entities then load them into those tables

Questions Regarding Database Creation