Difference between revisions of "Kevin Wyllie Week 6"
From LMU BioDB 2015
(Image didn't work for some reason, fixing that.) |
(Just saving my progress in case of timeout.) |
||
Line 4: | Line 4: | ||
==Electronic Journal Assignment== | ==Electronic Journal Assignment== | ||
− | + | *Veronica and I worked quite a bit with Anu, initially. To no avail, we spent approximately 3.5 hours attempting to convert application.txt into an SQL-friendly file (Sunday, 10/11/15).** | |
===What We Tried First...=== | ===What We Tried First...=== | ||
− | * [[file:kw_week6_screenshot1.png|right|thumb]]Although columns | + | * [[file:kw_week6_screenshot1.png|right|thumb]]Although columns were separated by tabs, in many cases they were also separated by a variable number of spaces (depending on the length of the Sponsor Applicant name). At the time, we thought this was our main obstacle in rewriting the lines for pgAdmin. The command: <code>sed "s/\t/, /g"</code> was an attempt to replace all tabs between columns with a comma followed by a space (we chose a comma since it is what SQL uses). But of course, this was when we discovered the extra spaces. You'll also notice that the formatting of the rows is inconsistent (a minority of the rows actually did what we expected them to do). This stressed us out too. |
Line 16: | Line 16: | ||
− | |||
+ | * [[file:kw_week6_Screenshot2.png|right|thumb]]Opting to address the parts of the file that I did know how to address, my next command was to remove the four spaces between the ApplType and SponsorApplicant columns, something that was consistent across all of the rows: <code>sed -r "s/( ){4}//g"</code>. What I had failed to realize was that this would still remove groups of four spaces from the aforementioned gaps between the SponsorApplicant and MostRecentLabelAvailableFlag columns, leaving the gaps as either doublet or triplet spaces. Seeing this, I added two more commands to remove these remaining spaces: <code>sed -r "s/( ){3}//g" | sed -r "s/( ){2}//g"</code>. | ||
Line 24: | Line 24: | ||
− | * [[file:Kw week6 Screenshot3.png|right|thumb]]But there were still remaining spaces. So we used <code>sed "s/ , False/, False/g" | sed "s/ , True/, True/g" | sed "s/, /,/g" | sed "s/,/','/g" | sed "s/^/'/g"</code> to eliminate these spaces and add "single-quotes" around each entry (since those that don't **need** single-quotes can still function properly with them). | + | |
+ | |||
+ | * [[file:Kw week6 Screenshot3.png|right|thumb]]But there were still remaining spaces. So we used <code>sed "s/ , False/, False/g" | sed "s/ , True/, True/g" | sed "s/, /,/g" | sed "s/,/','/g" | sed "s/^/'/g"</code> to eliminate these spaces and add "single-quotes" around each entry (since those that don't **need** single-quotes can still function properly with them). This returned something that looked hopeful. | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | * [[file:kw_week6_Screenshot4.png|right|thumb]]Through prior experimentation with pgAdmin, we'd discovered that "null" was required for blank spaces in entries. So the final step was to add these nulls, with: <code>sed "s/''/null/g" | sed "s/,'\r$/,null/g" | sed "s/,'.\r$/&'/g"</code>. Again, the result seemed satisfactory. | ||
+ | |||
+ | * The final pipe was: | ||
+ | <code>cat application.txt | sed "s/\t/, /g" | sed -r "s/( ){4}//g" | sed -r "s/( ){3}//g" | sed -r "s/( ){2}//g" | sed "s/ , False/, False/g" | sed "s/ , True/, True/g" | sed "s/, /,/g" | sed "s/,/','/g" | sed "s/^/'/g" | sed "s/''/null/g" | sed "s/,'\r$/,null/g" | sed "s/,'.\r$/&'/g" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values('/g" | sed "s/$/);/g" > ~/public_html/application.sql.txt</code> | ||
+ | |||
+ | * However, after several rounds of adding the SQL syntax, attempting to past the result into pgAdmin, and receiving various error messages, we realized this somewhat convoluted pipe was not going to work. | ||
+ | |||
+ | ===...And Why it Didn't Work=== | ||
+ | |||
+ | * Most importantly, a few of the sponsor applicant names contained commas, which is a problem for this pipe, which assumes commas are used only as column dividers. | ||
+ | * Second, for reasons I can't explain, this pipe still leaves spaces after some sponsor applicant names (like in <code>'APPRECIA PHARMS CO '</code>) which may not seem like a serious issue, but would make searching for these names in SQL difficult because the space would be considered part of the name. |
Revision as of 04:13, 13 October 2015
Contents
Questions
Electronic Journal Assignment
- Veronica and I worked quite a bit with Anu, initially. To no avail, we spent approximately 3.5 hours attempting to convert application.txt into an SQL-friendly file (Sunday, 10/11/15).**
What We Tried First...
- Although columns were separated by tabs, in many cases they were also separated by a variable number of spaces (depending on the length of the Sponsor Applicant name). At the time, we thought this was our main obstacle in rewriting the lines for pgAdmin. The command:
sed "s/\t/, /g"
was an attempt to replace all tabs between columns with a comma followed by a space (we chose a comma since it is what SQL uses). But of course, this was when we discovered the extra spaces. You'll also notice that the formatting of the rows is inconsistent (a minority of the rows actually did what we expected them to do). This stressed us out too.
- Opting to address the parts of the file that I did know how to address, my next command was to remove the four spaces between the ApplType and SponsorApplicant columns, something that was consistent across all of the rows:
sed -r "s/( ){4}//g"
. What I had failed to realize was that this would still remove groups of four spaces from the aforementioned gaps between the SponsorApplicant and MostRecentLabelAvailableFlag columns, leaving the gaps as either doublet or triplet spaces. Seeing this, I added two more commands to remove these remaining spaces:sed -r "s/( ){3}//g" | sed -r "s/( ){2}//g"
.
- But there were still remaining spaces. So we used
sed "s/ , False/, False/g" | sed "s/ , True/, True/g" | sed "s/, /,/g" | sed "s/,/','/g" | sed "s/^/'/g"
to eliminate these spaces and add "single-quotes" around each entry (since those that don't **need** single-quotes can still function properly with them). This returned something that looked hopeful.
- Through prior experimentation with pgAdmin, we'd discovered that "null" was required for blank spaces in entries. So the final step was to add these nulls, with:
sed "s//null/g" | sed "s/,'\r$/,null/g" | sed "s/,'.\r$/&'/g"
. Again, the result seemed satisfactory.
- The final pipe was:
cat application.txt | sed "s/\t/, /g" | sed -r "s/( ){4}//g" | sed -r "s/( ){3}//g" | sed -r "s/( ){2}//g" | sed "s/ , False/, False/g" | sed "s/ , True/, True/g" | sed "s/, /,/g" | sed "s/,/','/g" | sed "s/^/'/g" | sed "s//null/g" | sed "s/,'\r$/,null/g" | sed "s/,'.\r$/&'/g" | sed "s/^/insert into application(ApplNo,ApplType,SponsorApplicant,MostRecentLabelAvailableFlag,CurrentPatentFlag,ActionType,Chemical_Type,Ther_Potential,Orphan_Code) values('/g" | sed "s/$/);/g" > ~/public_html/application.sql.txt
- However, after several rounds of adding the SQL syntax, attempting to past the result into pgAdmin, and receiving various error messages, we realized this somewhat convoluted pipe was not going to work.
...And Why it Didn't Work
- Most importantly, a few of the sponsor applicant names contained commas, which is a problem for this pipe, which assumes commas are used only as column dividers.
- Second, for reasons I can't explain, this pipe still leaves spaces after some sponsor applicant names (like in
'APPRECIA PHARMS CO '
) which may not seem like a serious issue, but would make searching for these names in SQL difficult because the space would be considered part of the name.