Kevin Wyllie Week 6

From LMU BioDB 2015
Revision as of 04:13, 13 October 2015 by Kwyllie (Talk | contribs) (Just saving my progress in case of timeout.)

Jump to: navigation, search

Questions

Electronic Journal Assignment

  • Veronica and I worked quite a bit with Anu, initially. To no avail, we spent approximately 3.5 hours attempting to convert application.txt into an SQL-friendly file (Sunday, 10/11/15).**

What We Tried First...

  • Kw week6 screenshot1.png
    Although columns were separated by tabs, in many cases they were also separated by a variable number of spaces (depending on the length of the Sponsor Applicant name). At the time, we thought this was our main obstacle in rewriting the lines for pgAdmin. The command: sed "s/\t/, /g" was an attempt to replace all tabs between columns with a comma followed by a space (we chose a comma since it is what SQL uses). But of course, this was when we discovered the extra spaces. You'll also notice that the formatting of the rows is inconsistent (a minority of the rows actually did what we expected them to do). This stressed us out too.





  • Kw week6 Screenshot2.png
    Opting to address the parts of the file that I did know how to address, my next command was to remove the four spaces between the ApplType and SponsorApplicant columns, something that was consistent across all of the rows: sed -r "s/( ){4}//g". What I had failed to realize was that this would still remove groups of four spaces from the aforementioned gaps between the SponsorApplicant and MostRecentLabelAvailableFlag columns, leaving the gaps as either doublet or triplet spaces. Seeing this, I added two more commands to remove these remaining spaces: sed -r "s/( ){3}//g" | sed -r "s/( ){2}//g".





  • Kw week6 Screenshot3.png
    But there were still remaining spaces. So we used sed "s/ , False/, False/g" | sed "s/ , True/, True/g" | sed "s/, /,/g" | sed "s/,/','/g" | sed "s/^/'/g" to eliminate these spaces and add "single-quotes" around each entry (since those that don't **need** single-quotes can still function properly with them). This returned something that looked hopeful.





  • Kw week6 Screenshot4.png
    Through prior experimentation with pgAdmin, we'd discovered that "null" was required for blank spaces in entries. So the final step was to add these nulls, with: sed "s//null/g" | sed "s/,'\r$/,null/g" | sed "s/,'.\r$/&'/g". Again, the result seemed satisfactory.
  • The final pipe was:

cat application.txt | sed "s/\t/, /g" | sed -r "s/( ){4}//g" | sed -r "s/( ){3}//g" | sed -r "s/( ){2}//g" | sed "s/ , False/, False/g" | sed "s/ , True/, True/g" | sed "s/, /,/g" | sed "s/,/','/g" | sed "s/^/'/g" | sed "s//null/g" | sed "s/,'\r$/,null/g" | sed "s/,'.\r$/&'/g" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values('/g" | sed "s/$/);/g" > ~/public_html/application.sql.txt

  • However, after several rounds of adding the SQL syntax, attempting to past the result into pgAdmin, and receiving various error messages, we realized this somewhat convoluted pipe was not going to work.

...And Why it Didn't Work

  • Most importantly, a few of the sponsor applicant names contained commas, which is a problem for this pipe, which assumes commas are used only as column dividers.
  • Second, for reasons I can't explain, this pipe still leaves spaces after some sponsor applicant names (like in 'APPRECIA PHARMS CO ') which may not seem like a serious issue, but would make searching for these names in SQL difficult because the space would be considered part of the name.