Difference between revisions of "Bklein7 Week 4"

From LMU BioDB 2015
Jump to: navigation, search
(added minor grammar edits)
(added tss tagging methods)
Line 20: Line 20:
 
   sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.     
 
   sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.     
 
   {17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box/g"
 
   {17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box/g"
*'''TSS'''
+
*'''tss'''
 
+
**Because the transcription start site is located 12 nucleotides from the beginning of the 6 nucleotide long minus 10 box, I reasoned that it would be simplest to calculate its location as 6 characters after the <code></minus10box></code> tag. This means that there are 5 characters between that tag before the <code><tss></code> tag should be inserted. Subsequently, the <code></tss><code> end tag should be placed one character after the tss start tag.
 +
**#To accomplish this, I strung together two sed commands using a semicolon to both insert the start and end tags at the right locations.
 +
**#I piped this command to the growing sequence:
 +
  sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.   
 +
  {17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box/g" | sed -r
 +
  "s/<\/minus10box>.{5}/&<tss>/g;s/<tss>./&<\/tss/g"
 +
*'''rbs'''
 +
**
  
 
# Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):
 
# Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):
#* -35 box of the promoter <pre>... <minus35box>...</minus35box> ...</pre>
 
#* -10 box of the promoter <pre>... <minus10box>...</minus10box> ...</pre>
 
#* transcription start site <pre>... <tss>...</tss> ...</pre>
 
 
#* ribosome binding site <pre>... <rbs>...</rbs> ...</pre>
 
#* ribosome binding site <pre>... <rbs>...</rbs> ...</pre>
 
#* start codon <pre>... <start_codon>...</start_codon> ...</pre>
 
#* start codon <pre>... <start_codon>...</start_codon> ...</pre>
Line 37: Line 41:
 
   sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.     
 
   sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.     
 
   {17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box/g" | sed -r  
 
   {17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box/g" | sed -r  
   "s/<\/minus10box>.{11}/&<tss>/g" | sed "s/<tss>./&<\/tss/g" | sed "s/gagg/<rbs>&<\/rbs>\n/g"  
+
   "s/<\/minus10box>.{5}/&<tss>/g;s/<tss>./&<\/tss/g" | sed "s/gagg/<rbs>&<\/rbs>\n/g"  
 
   | sed "3s/atg/<start_codon>&<\/start_codon>\n/1" | sed "s/aaaaggt/\n<terminator>&/g" | sed -
 
   | sed "3s/atg/<start_codon>&<\/start_codon>\n/1" | sed "s/aaaaggt/\n<terminator>&/g" | sed -
 
   r "s/aaaaggt.*gcctttt.{4}/&<\/terminator>/g" | sed "4s/.../& /g" | sed -r  
 
   r "s/aaaaggt.*gcctttt.{4}/&<\/terminator>/g" | sed "4s/.../& /g" | sed -r  

Revision as of 22:14, 28 September 2015

Transcription and Translation "Taken to the Next Level"

This assignment centers on text manipulation of the nucleotide sequence present in the file infA-E.coli-K12.txt. Therefore, I began this assignment by accessing the directory ~dondi/xmlpipedb/data and copying this file to my personal directory using the command cp infA-E.coli-K12.txt bklein7. From there, I evoked cd to return to the directory bklein7 and begin the assignment.

Using Code to Tag the Nucleotide Sequence (Part #1)

  • Minus 35 Box
    • I began by identifying the minus 35 box. Using the tips from the assignment, I used the command grep "tt[gt]ac[at]" infA-E.coli-K12.txt to identify sequences matches for the minus 35 box. There were two matches.
    • To identify which match was the the minus 35 box, I performed the command grep "[ct]at[at]at" infA-E.coli-K12.txt to identify possible minus 10 box matches. Because we know the minus 10 box match must occur 17 nucleotides after the minus 35 box, only one pair of matches for the minus 35 & 10 boxes was possible. Although this process does require some subjective analysis, we discussed in class that this was permissible in this case to initiate the tagging process.
    • Having identified the minus 35 box as the first of the two grep matches, I used a sed command to tag this sequence.
      1. The sed command was in the form s///1 to manipulate the first match to the sequence tt[gt]ac[at].
      2. The "\" symbol was used to escape the forward slash in the tag.
      3. A new line was started after the tag to make it simpler to perform subsequent operations on the sequence.
      4. The command is as follows:
 sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt
  • Minus 10 Box
    • After identifying the minus 35 box, I broke the nucleotide sequence up into two lines as detailed above. Because we know that the minus 10 box exists 17 nucleotides after the minus 35 box and is 6 nucleotides long, this meant that the minus 10 box presumably would always consist of characters 18-23 in line 2 of the code. I verified this for the sequence present in infA-E.coli-K12.txt by checking that the sequence 17 nucleotides after the tagged minus 35 box did indeed match the minus 10 box sequence.
    • To automate the tagging of the minus 10 box, I added a series of two separate sed commands to the pipe that would subjectively find this sequence based on its location in line 2.
      1. I wrote the first sed command to add the tag <minus10box> after the 17th character of the second line, taking advantage of the sed -r repetition shortcut.
      2. I wrote the second sed command to add the tag </minus10box> 6 characters after the first minus 10 box tag.
      3. With the addition of these two commands, my command sequence was as follows:
 sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.    
 {17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box/g"
  • tss
    • Because the transcription start site is located 12 nucleotides from the beginning of the 6 nucleotide long minus 10 box, I reasoned that it would be simplest to calculate its location as 6 characters after the </minus10box> tag. This means that there are 5 characters between that tag before the <tss> tag should be inserted. Subsequently, the </tss><code> end tag should be placed one character after the tss start tag.
      1. To accomplish this, I strung together two sed commands using a semicolon to both insert the start and end tags at the right locations.
      2. I piped this command to the growing sequence:
 sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.    
 {17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box/g" | sed -r 
 "s/<\/minus10box>.{5}/&<tss>/g;s/<tss>./&<\/tss/g"
  • rbs
  1. Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):
    • ribosome binding site
      ... <rbs>...</rbs> ...
    • start codon
      ... <start_codon>...</start_codon> ...
    • stop codon
      ... <stop_codon>...</stop_codon> ...
    • terminator
      ... <terminator>...</terminator> ...
  2. What is the exact mRNA sequence that is transcribed from this gene?
  3. What is the amino acid sequence that is translated from this mRNA?

Preliminary Code

 sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.    
 {17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box/g" | sed -r 
 "s/<\/minus10box>.{5}/&<tss>/g;s/<tss>./&<\/tss/g" | sed "s/gagg/<rbs>&<\/rbs>\n/g" 
 | sed "3s/atg/<start_codon>&<\/start_codon>\n/1" | sed "s/aaaaggt/\n<terminator>&/g" | sed -
 r "s/aaaaggt.*gcctttt.{4}/&<\/terminator>/g" | sed "4s/.../& /g" | sed -r 
 "4s/taa|tag|tga/<stop_codon>&<\/stop_codon>/g" | sed "4s/ //g" | sed ':a;N;$!ba;s/\n//g'
  • String together redundant sed commands with ; or find a way to make them more compact
  • Verify when the terminator sequence starts
  • Is there a way to count backwards in a line with sed? or so replace only the last instance that matches? previous experiments
    • sed "s/x.*$/y/g" - does not work, wild card overtakes the last instance
    • sed "s/x/y/1$" - does not work
    • answer: use rev commands!

Links

Assignments Pages

Individual Journal Entries

Shared Journal Entries