Difference between revisions of "Kevin Wyllie Week 6"

From LMU BioDB 2015
Jump to: navigation, search
(Questions)
(Added answers to some of the questions.)
Line 3: Line 3:
 
# '''Provide the DDL (create table) statements that you used for your application and product tables.'''
 
# '''Provide the DDL (create table) statements that you used for your application and product tables.'''
 
#* Application:
 
#* Application:
#**
+
#** <code>create table application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)</code>
 
# '''Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.'''
 
# '''Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.'''
 
#* Application:
 
#* Application:

Revision as of 02:10, 14 October 2015

Questions

  1. Provide the DDL (create table) statements that you used for your application and product tables.
    • Application:
      • create table application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)
  2. Provide the sed command sequences that you used to convert the raw text files into sequences of SQL insert statements.
    • Application:
      • cat application.txt | sed "s/\t/~/g" | sed -r "s/( ){2}//g" | sed "s/ ~/~/g" | sed "s/~~/~null~/g" | sed "s/~~/~null~/g" | sed "s/~\r$/~null/g" | sed "s/~/~'/1" | sed "s/~/~'/2" | sed "s/~/~'/5" | sed "s/~/~'/7" | sed "s/~/~'/8" | sed "s/~/'~/2" | sed "s/~/'~/3" | sed "s/~/'~/6" | sed "s/~/'~/8" | sed "s/$/'/g" | sed "s/'null'/null/g" | sed "s/~/,/g" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | sed "s/$/);/g" > ~/public_html/application.sql.txt

Protocol

Protocol - application.txt

  • Veronica and I worked quite a bit with Anu, initially. To no avail, we spent approximately 3.5 hours attempting to convert application.txt into an SQL-friendly file (Sunday, 10/11/15).**

What We Tried First...

  • Kw week6 screenshot1.png
    Although columns were separated by tabs, in many cases they were also separated by a variable number of spaces (depending on the length of the Sponsor Applicant name). At the time, we thought this was our main obstacle in rewriting the lines for pgAdmin. The command: sed "s/\t/, /g" was an attempt to replace all tabs between columns with a comma followed by a space (we chose a comma since it is what SQL uses). But of course, this was when we discovered the extra spaces. You'll also notice that the formatting of the rows is inconsistent (a minority of the rows actually did what we expected them to do). This stressed us out too.





  • Kw week6 Screenshot2.png
    Opting to address the parts of the file that I did know how to address, my next command was to remove the four spaces between the ApplType and SponsorApplicant columns, something that was consistent across all of the rows: sed -r "s/( ){4}//g". What I had failed to realize was that this would still remove groups of four spaces from the aforementioned gaps between the SponsorApplicant and MostRecentLabelAvailableFlag columns, leaving the gaps as either doublet or triplet spaces. Seeing this, I added two more commands to remove these remaining spaces: sed -r "s/( ){3}//g" | sed -r "s/( ){2}//g".





  • Kw week6 Screenshot3.png
    But there were still remaining spaces. So we used sed "s/ , False/, False/g" | sed "s/ , True/, True/g" | sed "s/, /,/g" | sed "s/,/','/g" | sed "s/^/'/g" to eliminate these spaces and add "single-quotes" around each entry (since those that don't **need** single-quotes can still function properly with them). This returned something that looked hopeful.





  • Kw week6 Screenshot4.png
    Through prior experimentation with pgAdmin, we'd discovered that "null" was required for blank spaces in entries. So the final step was to add these nulls, with: sed "s//null/g" | sed "s/,'\r$/,null/g" | sed "s/,'.\r$/&'/g". Again, the result seemed satisfactory.






  • The final pipe was:

cat application.txt | sed "s/\t/, /g" | sed -r "s/( ){4}//g" | sed -r "s/( ){3}//g" | sed -r "s/( ){2}//g" | sed "s/ , False/, False/g" | sed "s/ , True/, True/g" | sed "s/, /,/g" | sed "s/,/','/g" | sed "s/^/'/g" | sed "s//null/g" | sed "s/,'\r$/,null/g" | sed "s/,'.\r$/&'/g" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values('/g" | sed "s/$/);/g" > ~/public_html/application.sql.txt


  • However, after several rounds of adding the SQL syntax, attempting to past the result into pgAdmin, and receiving various error messages, we realized this somewhat convoluted pipe was not going to work.

...And Why it Didn't Work

  • Most importantly, a few of the sponsor applicant names contained commas, which is a problem for this pipe, which assumes commas are used only as column dividers.
  • Second, for reasons I can't explain, this pipe still leaves spaces after some sponsor applicant names (like in 'APPRECIA PHARMS CO ', visible in the last screenshot) which may not seem like a serious issue, but would make searching for these names in SQL difficult because the space would be considered part of the name.
  • For many other reasons, most of which were unknown, we were never able to get pgAdmin to accept these commands.

A New Command

  1. As Anu discovered, one viable character for separating columns is the tilde (~) because it does not appear elsewhere in the file.
    • sed "s/\t/~/g" replaces all tabs with tildes...
  2. Still, the consecutive spaces (between SponsorApplicant and MostRecentLabelAvailableFlag) remain. But removing all spaces is problematic because many of the SponsorApplicant entries have spaces in them, which need to stay in the file.
    • sed "sed -r "s/( ){2}//g" will remove most of the unnecessary spaces without removing those in sponsor applicant names.
  3. But the previous command leaves a few lingering single spaces at the end of sponsor applicant names. However these will always precede a tilde.
    • sed "s/ ~/~/g" will remove these final spaces.
  4. Blank inputs need to be designated with "null." One way to do this is to select for consecutive tildes (remember, tildes now denote the beginning of the column).
    • sed "s/~~/~null~/g" adds "null" between double tildes.
    • In the case of three consecutive tildes, the first execution of this command will interrupt the second pair in the triplet, causing sed to not recognize it. For this reason, this command must be entered twice.
  5. There are also some (many) blank orphan codes, which are in the last column. For this reason, sed has not recognized these as blank entries, up to this point. However, a line ending in a tilde conclusively means that the orphan code for that entry is blank.
    • sed "s/~\r/~null/g" selects for tildes at the very end of the line and adds the last necessary null.
  6. The next step is to add "single quotes" flanking any "varchar" inputs.
    • Start with the leading single quotes: sed "s/~/~'/1" | sed "s/~/~'/2" | sed "s/~/~'/5" | sed "s/~/~'/7" | sed "s/~/~'/8"
    • Then add the following single quotes: sed "s/~/'~/2" | sed "s/~/'~/3" | sed "s/~/'~/6" | sed "s/~/'~/8"
    • When the orphan code isn't blank, it is a varchar, so add a single quote to the end of the line: sed "s/$/'/g"
      • For some reason, the \r$ syntax (mentioned on the week 6 assignment page), used to designate the end of a line, doesn't work here, while the usual $ does.
  7. Blank inputs need to be entered as null specifically. If 'null' is used, this will be considered a standard varchar input.
    • sed "s/'null'/null/g" solves this issue.
  8. Now that column content is flanked by single quotes, the tildes can become commas.
    • sed "s/~/,/g"
  9. Now the SQL syntax must be added to each line, and the modified text file must be uploaded to your server. (See the PostgreSQL Tutorial)
    • sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values(/g" | sed "s/$/);/g" > ~/public_html/application.sql.txt
      • Note: this command is slightly different than on the tutorial. The final single-quote in the second sed command must be removed.
  10. Launch the SQL and create a new table.
    • create table application (ApplNo int primary key, ApplType varchar, SponsorApplicant varchar, MostRecentLabelAvailableFlag boolean, CurrentPatentFlag boolean, ActionType varchar, Chemical_Type int, Ther_Potential varchar, Orphan_Code varchar)
  11. Go to the webpage (my.cs.lmu.edu/~username/application.sql.txt) and copy the text. Paste it into the SQL command window, delete the firs tline, and press the "execute query" button.