Difference between revisions of "Kevin Wyllie Week 6"

From LMU BioDB 2015
Jump to: navigation, search
(More formatting fixes.)
(Just saving my progress in case of timeout.)
Line 2: Line 2:
  
  
==Electronic Journal Assignment==
+
==Electronic Journal==
 +
 
 +
===application.txt===
  
 
*Veronica and I worked quite a bit with Anu, initially. To no avail, we spent approximately 3.5 hours attempting to convert application.txt into an SQL-friendly file (Sunday, 10/11/15).**
 
*Veronica and I worked quite a bit with Anu, initially. To no avail, we spent approximately 3.5 hours attempting to convert application.txt into an SQL-friendly file (Sunday, 10/11/15).**
  
===What We Tried First...===
+
====What We Tried First...====
  
 
* [[file:kw_week6_screenshot1.png|right|thumb]]Although columns were separated by tabs, in many cases they were also separated by a variable number of spaces (depending on the length of the Sponsor Applicant name). At the time, we thought this was our main obstacle in rewriting the lines for pgAdmin. The command: <code>sed "s/\t/, /g"</code> was an attempt to replace all tabs between columns with a comma followed by a space (we chose a comma since it is what SQL uses). But of course, this was when we discovered the extra spaces. You'll also notice that the formatting of the rows is inconsistent (a minority of the rows actually did what we expected them to do). This stressed us out too.  
 
* [[file:kw_week6_screenshot1.png|right|thumb]]Although columns were separated by tabs, in many cases they were also separated by a variable number of spaces (depending on the length of the Sponsor Applicant name). At the time, we thought this was our main obstacle in rewriting the lines for pgAdmin. The command: <code>sed "s/\t/, /g"</code> was an attempt to replace all tabs between columns with a comma followed by a space (we chose a comma since it is what SQL uses). But of course, this was when we discovered the extra spaces. You'll also notice that the formatting of the rows is inconsistent (a minority of the rows actually did what we expected them to do). This stressed us out too.  
Line 52: Line 54:
 
* However, after several rounds of adding the SQL syntax, attempting to past the result into pgAdmin, and receiving various error messages, we realized this somewhat convoluted pipe was not going to work.
 
* However, after several rounds of adding the SQL syntax, attempting to past the result into pgAdmin, and receiving various error messages, we realized this somewhat convoluted pipe was not going to work.
  
===...And Why it Didn't Work===
+
====...And Why it Didn't Work====
  
 
* Most importantly, a few of the sponsor applicant names contained commas, which is a problem for this pipe, which assumes commas are used only as column dividers.
 
* Most importantly, a few of the sponsor applicant names contained commas, which is a problem for this pipe, which assumes commas are used only as column dividers.
 
* Second, for reasons I can't explain, this pipe still leaves spaces after some sponsor applicant names (like in <code>'APPRECIA PHARMS CO '</code>, visible in the last screenshot) which may not seem like a serious issue, but would make searching for these names in SQL difficult because the space would be considered part of the name.
 
* Second, for reasons I can't explain, this pipe still leaves spaces after some sponsor applicant names (like in <code>'APPRECIA PHARMS CO '</code>, visible in the last screenshot) which may not seem like a serious issue, but would make searching for these names in SQL difficult because the space would be considered part of the name.
 +
* For many other reasons, most of which were unknown, we were never able to get pgAdmin to accept these commands.
 +
 +
====A New Command====
 +
 +
* As Anu discovered, one viable character for separating columns is the tilde (<code>~</code>) because it does not appear elsewhere in the file.
 +
** <code>sed "s/\t/~/g"</code> replaces all tabs with tildes...
 +
* Still, the consecutive spaces (between SponsorApplicant and MostRecentLabelAvailableFlag) remain. But removing all spaces is problematic because many of the SponsorApplicant entries have spaces in them, which need to stay in the file.
 +
** <code>sed "sed -r "s/( ){2}//g"</code> will remove most of the unnecessary spaces without removing those in sponsor applicant names.
 +
* But the previous command leaves a few lingering single spaces at the end of sponsor applicant names. However these will ''always'' precede a tilde.
 +
** <code>sed "s/ ~/~/g"</code> will remove these final spaces.

Revision as of 04:51, 13 October 2015

Questions

Electronic Journal

application.txt

  • Veronica and I worked quite a bit with Anu, initially. To no avail, we spent approximately 3.5 hours attempting to convert application.txt into an SQL-friendly file (Sunday, 10/11/15).**

What We Tried First...

  • Kw week6 screenshot1.png
    Although columns were separated by tabs, in many cases they were also separated by a variable number of spaces (depending on the length of the Sponsor Applicant name). At the time, we thought this was our main obstacle in rewriting the lines for pgAdmin. The command: sed "s/\t/, /g" was an attempt to replace all tabs between columns with a comma followed by a space (we chose a comma since it is what SQL uses). But of course, this was when we discovered the extra spaces. You'll also notice that the formatting of the rows is inconsistent (a minority of the rows actually did what we expected them to do). This stressed us out too.





  • Kw week6 Screenshot2.png
    Opting to address the parts of the file that I did know how to address, my next command was to remove the four spaces between the ApplType and SponsorApplicant columns, something that was consistent across all of the rows: sed -r "s/( ){4}//g". What I had failed to realize was that this would still remove groups of four spaces from the aforementioned gaps between the SponsorApplicant and MostRecentLabelAvailableFlag columns, leaving the gaps as either doublet or triplet spaces. Seeing this, I added two more commands to remove these remaining spaces: sed -r "s/( ){3}//g" | sed -r "s/( ){2}//g".





  • Kw week6 Screenshot3.png
    But there were still remaining spaces. So we used sed "s/ , False/, False/g" | sed "s/ , True/, True/g" | sed "s/, /,/g" | sed "s/,/','/g" | sed "s/^/'/g" to eliminate these spaces and add "single-quotes" around each entry (since those that don't **need** single-quotes can still function properly with them). This returned something that looked hopeful.





  • Kw week6 Screenshot4.png
    Through prior experimentation with pgAdmin, we'd discovered that "null" was required for blank spaces in entries. So the final step was to add these nulls, with: sed "s//null/g" | sed "s/,'\r$/,null/g" | sed "s/,'.\r$/&'/g". Again, the result seemed satisfactory.






  • The final pipe was:

cat application.txt | sed "s/\t/, /g" | sed -r "s/( ){4}//g" | sed -r "s/( ){3}//g" | sed -r "s/( ){2}//g" | sed "s/ , False/, False/g" | sed "s/ , True/, True/g" | sed "s/, /,/g" | sed "s/,/','/g" | sed "s/^/'/g" | sed "s//null/g" | sed "s/,'\r$/,null/g" | sed "s/,'.\r$/&'/g" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values('/g" | sed "s/$/);/g" > ~/public_html/application.sql.txt


  • However, after several rounds of adding the SQL syntax, attempting to past the result into pgAdmin, and receiving various error messages, we realized this somewhat convoluted pipe was not going to work.

...And Why it Didn't Work

  • Most importantly, a few of the sponsor applicant names contained commas, which is a problem for this pipe, which assumes commas are used only as column dividers.
  • Second, for reasons I can't explain, this pipe still leaves spaces after some sponsor applicant names (like in 'APPRECIA PHARMS CO ', visible in the last screenshot) which may not seem like a serious issue, but would make searching for these names in SQL difficult because the space would be considered part of the name.
  • For many other reasons, most of which were unknown, we were never able to get pgAdmin to accept these commands.

A New Command

  • As Anu discovered, one viable character for separating columns is the tilde (~) because it does not appear elsewhere in the file.
    • sed "s/\t/~/g" replaces all tabs with tildes...
  • Still, the consecutive spaces (between SponsorApplicant and MostRecentLabelAvailableFlag) remain. But removing all spaces is problematic because many of the SponsorApplicant entries have spaces in them, which need to stay in the file.
    • sed "sed -r "s/( ){2}//g" will remove most of the unnecessary spaces without removing those in sponsor applicant names.
  • But the previous command leaves a few lingering single spaces at the end of sponsor applicant names. However these will always precede a tilde.
    • sed "s/ ~/~/g" will remove these final spaces.