Difference between revisions of "Blitvak Week 4"

From LMU BioDB 2015
Jump to: navigation, search
(second edit of the -10/-35 section)
(finished the -35/-10 section)
Line 3: Line 3:
 
===Finding the -35 and -10 boxes of the promoter===
 
===Finding the -35 and -10 boxes of the promoter===
  
*I first found ''infA-E.coli-K12.txt'' by entering the correct directory, <code>cd ~dondi/xmlpipedb/data</code>
+
*I first found ''infA-E.coli-K12.txt'' by entering the correct directory, <code>cd ~dondi/xmlpipedb/data</code>.
*I copied the sequence kept in that file on an external space for future reference/checking
+
*I copied the sequence kept in that file on an external space for future reference/checking.
*I assumed that the sequence is the mRNA-like strand and that it runs from 5'- 3'
+
*I assumed that the sequence is the mRNA-like strand and that it runs from 5'- 3'.
 
*By reading the [[Week 4 | Week 4 Assignment Page]], I found that the -10 box is generally <code>[ct]at[at]at</code>, and that the -35 box is generally <code>tt[gt]ac[at]</code>
 
*By reading the [[Week 4 | Week 4 Assignment Page]], I found that the -10 box is generally <code>[ct]at[at]at</code>, and that the -35 box is generally <code>tt[gt]ac[at]</code>
*I skimmed over the [[More Text Processing Features |More Text Processing Features page]], and I found that <code><nowiki>sed "s/Title/<h1>&<\/h1>/g"</nowiki></code> results in an output of <nowiki><h1>Title</h1></nowiki>; this command would be useful in tagging the sequence with its various parts
+
*I skimmed over the [[More Text Processing Features |More Text Processing Features page]], and I found that <code><nowiki>sed "s/Title/<h1>&<\/h1>/g"</nowiki></code> results in an output of <nowiki><h1>Title</h1></nowiki>; this command would be useful in tagging the sequence with its various parts.
*I then tried <code>cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>/g"| sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>/g"</code>
+
*I then tried <code>cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>/g"| sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>/g"</code>.
 
**Gave me the output:                           
 
**Gave me the output:                           
 
  ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt<minus10box>tataat</minus10box>tgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggag
 
  ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt<minus10box>tataat</minus10box>tgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggag
Line 14: Line 14:
 
  aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaa
 
  aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaa
 
  gagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
 
  gagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
*I realized that there are two possibilities for the minus35box and one for the minus10box (since the minus10box must come after the minus35box, the first instance of a "minus10box" is to be ignored)
+
*I realized that there are two possibilities for the minus35box and one for the minus10box (since the minus10box must come after the minus35box, the first instance of a "minus10box" is to be ignored).
*I looked at the [[Week 4 | Week 4 Assignment Page]], and I found that there is an ideal  number of 17 base pairs between the -35 and -10 box. Only <code><nowiki><minus35box>tttact</minus35box></nowiki></code> fits this criteria (is 17 bp away from <code><nowiki><minus10box>cattat</minus10box></nowiki></code>)
+
*I looked at the [[Week 4 | Week 4 Assignment Page]], and I found that there is an ideal  number of 17 base pairs between the -35 and -10 box. Only <code><nowiki><minus35box>tttact</minus35box></nowiki></code> fits this criteria (is 17 bp away from <code><nowiki><minus10box>cattat</minus10box></nowiki></code>).
 +
*In the [[More Text Processing Features |More Text Processing Features page]], I found that <code>sed "s/paragraph/&\n/g"</code> results in a line-break right after the pattern gets matched/replaced (this would be useful in making the text more manageable for ''sed'', as it is line centric). From referencing that same page, I found that replacing the ''g'' in the ''s///g'' format with a number, ''n'', will result in ''sed'' replacing only the ''n''th match on that line. Some of these sequences, that correspond to specific parts of the DNA strand, result in multiple matches; it would be fairly useful to limit sed to one replacement by making it only work on the ''n''th match.
 +
*I realized that adding a number before the ''s'' in the ''s///g'' format will limit ''sed'' to that line (ex. executing ''2s///g'', results in a replacement being made on the 2nd line). It should be useful to make several line-breaks in order to make any matching easier; later, these line-breaks will have to be removed.
 +
*In an in-class work session, I learned that <code>sed -r "<line#>s/^.{n}/<replacement>/g"</code> limits ''sed'' to the first ''n'' characters of a line.
 +
*I executed <code>cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1"| sed -r "2s/^.{17}/&<minus10box>\n/g" | sed -r "3s/^.{6}/&<\/minus10box>\n/g"</code> (took me some time to realize that the ''-r'' was necessary for this command to work!). In making up this command, I decided to add a break after finding the ''minus35box'' and I limited ''sed'' to the first match (since I realized that the first match for the ''minus35box'' is, in fact, the correct one). Since the ''minus10box'' is 17 bp away from the ''minus35box'', I decided to add the ''<minus10box>'' part of the tag after those 17 characters of the second line. I then decided to create another line-break after placing the first part of the ''minus10box'' tag;  on the third line, I exploited the fact that the ''minus10box'' is 6 bp long in order to create a ''sed'' command that would add the last part of the tag, ''</minus10box>'', after the actual ''minus10box''.
 +
**[[Media:bl_output_1.png|Output so far]]

Revision as of 22:14, 27 September 2015

Individual Journal Assignment Week 4

Finding the -35 and -10 boxes of the promoter

  • I first found infA-E.coli-K12.txt by entering the correct directory, cd ~dondi/xmlpipedb/data.
  • I copied the sequence kept in that file on an external space for future reference/checking.
  • I assumed that the sequence is the mRNA-like strand and that it runs from 5'- 3'.
  • By reading the Week 4 Assignment Page, I found that the -10 box is generally [ct]at[at]at, and that the -35 box is generally tt[gt]ac[at]
  • I skimmed over the More Text Processing Features page, and I found that sed "s/Title/<h1>&<\/h1>/g" results in an output of <h1>Title</h1>; this command would be useful in tagging the sequence with its various parts.
  • I then tried cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>/g"| sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>/g".
    • Gave me the output:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt<minus10box>tataat</minus10box>tgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggag
taatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>ta<minus35box>tttaca</minus35box>gaacttcgg<minus10box>cattat</minus10box>cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgc
aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaa
gagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
  • I realized that there are two possibilities for the minus35box and one for the minus10box (since the minus10box must come after the minus35box, the first instance of a "minus10box" is to be ignored).
  • I looked at the Week 4 Assignment Page, and I found that there is an ideal number of 17 base pairs between the -35 and -10 box. Only <minus35box>tttact</minus35box> fits this criteria (is 17 bp away from <minus10box>cattat</minus10box>).
  • In the More Text Processing Features page, I found that sed "s/paragraph/&\n/g" results in a line-break right after the pattern gets matched/replaced (this would be useful in making the text more manageable for sed, as it is line centric). From referencing that same page, I found that replacing the g in the s///g format with a number, n, will result in sed replacing only the nth match on that line. Some of these sequences, that correspond to specific parts of the DNA strand, result in multiple matches; it would be fairly useful to limit sed to one replacement by making it only work on the nth match.
  • I realized that adding a number before the s in the s///g format will limit sed to that line (ex. executing 2s///g, results in a replacement being made on the 2nd line). It should be useful to make several line-breaks in order to make any matching easier; later, these line-breaks will have to be removed.
  • In an in-class work session, I learned that sed -r "<line#>s/^.{n}/<replacement>/g" limits sed to the first n characters of a line.
  • I executed cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1"| sed -r "2s/^.{17}/&<minus10box>\n/g" | sed -r "3s/^.{6}/&<\/minus10box>\n/g" (took me some time to realize that the -r was necessary for this command to work!). In making up this command, I decided to add a break after finding the minus35box and I limited sed to the first match (since I realized that the first match for the minus35box is, in fact, the correct one). Since the minus10box is 17 bp away from the minus35box, I decided to add the <minus10box> part of the tag after those 17 characters of the second line. I then decided to create another line-break after placing the first part of the minus10box tag; on the third line, I exploited the fact that the minus10box is 6 bp long in order to create a sed command that would add the last part of the tag, </minus10box>, after the actual minus10box.