Difference between revisions of "Jwoodlee Week 4"

From LMU BioDB 2015
Jump to: navigation, search
(Transcription and Translation “Taken to the Next Level”)
(added template)
 
(16 intermediate revisions by the same user not shown)
Line 17: Line 17:
  
 
I opened up terminal, and used the ssh command to get into dondi's directory: ~dondi/xmlpipedb/data.  In there I got access to infA-E.coli-K12.txt which is the nucleotide sequence I will be using for this assignment.   
 
I opened up terminal, and used the ssh command to get into dondi's directory: ~dondi/xmlpipedb/data.  In there I got access to infA-E.coli-K12.txt which is the nucleotide sequence I will be using for this assignment.   
# Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):
+
* Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):
 
* -35 box of the promoter
 
* -35 box of the promoter
 
**As shown in class I used the sed command to get the first occurrence of the minus 35 strand in the sequence:
 
**As shown in class I used the sed command to get the first occurrence of the minus 35 strand in the sequence:
Line 36: Line 36:
 
** The stop codon requires I find one of three possible three character sequences.  At first I tried using brackets: "t[ag][ag]", but I soon found out that that yielded too many results.  There are only three stop codons and the brackets give me 4 unique codons.  So into the wiki I went, and realized I could use a vertical bar to separate three unique codons, and search for them.  The problem however, was that this did not work.  After being stumped for awhile I realized that before I piped to that command I needed to break up the line into sets of 3, just like I did in the week 3 assignment.  As a result I got this command: <code> cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1"</code>
 
** The stop codon requires I find one of three possible three character sequences.  At first I tried using brackets: "t[ag][ag]", but I soon found out that that yielded too many results.  There are only three stop codons and the brackets give me 4 unique codons.  So into the wiki I went, and realized I could use a vertical bar to separate three unique codons, and search for them.  The problem however, was that this did not work.  After being stumped for awhile I realized that before I piped to that command I needed to break up the line into sets of 3, just like I did in the week 3 assignment.  As a result I got this command: <code> cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1"</code>
 
*terminator
 
*terminator
** The first part of the terminator hairpin is: <code> aaaaggt </code>, which means, abiding by the rules of the terminator provided to us, that the first half bonds with <code> gcctttt </code>.  So now the trick is to grab the correct terminator sequence.  I ended up breaking the terminator command into two different commands.  I used one to insert the first tag, and the second one to insert the second tag.  I did this because I wasn't sure how long the sequence would be between the two hairpin sequences. This is what I got to capture the terminator sequence: <code> cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g" </code>
+
** The first part of the terminator hairpin is: <code>aaaaggt</code>, which means, abiding by the rules of the terminator provided to us, that the first half bonds with <code> gcctttt </code>.  So now the trick is to grab the correct terminator sequence.  I ended up breaking the terminator command into two different commands.  I used one to insert the first tag, and the second one to insert the second tag.  I did this because I wasn't sure how long the sequence would be between the two hairpin sequences. This is what I got to capture the terminator sequence: <code> cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g" </code>
 
* And so, finally, it is all marked up.  However I'm not quite done yet, I need to get rid of all the new lines I created.  In order to do this I used this command:  sed ':a;N;$!ba;s/\n//g' (from wiki), so the final output is as follows.
 
* And so, finally, it is all marked up.  However I'm not quite done yet, I need to get rid of all the new lines I created.  In order to do this I used this command:  sed ':a;N;$!ba;s/\n//g' (from wiki), so the final output is as follows.
#*(a)<code> cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g' </code>
+
*(a)
#*(b) 
+
  
 +
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcg
 +
gagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>tatttacagaacttcgg<minus10box>cattat</minus10box>cttgc<tss>c</tss>ggttcaaattacggtagtgatacccca
 +
<rbs>gagg</rbs>attag<start_codon>atg</start_codon>gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttac
 +
tgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgc<stop_codon>tga</stop_codon>
 +
ttgttttaccgcctgatgggcgaagagaaagaacgagt<terminator>aaaaggtcggtttaaccggcctttttatt</terminator>ttat
  
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcg    gagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>tatttacagaacttcgg<minus10box>cattat</minus10box>cttgc<tss>c</tss>ggttcaaattacggtagtgatacccca<rbs>gagg</r bs>attag<start_codon>atg</start_codon>gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctg acgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgc<stop_codon>tga</stop_codon>ttgttttaccgcctgatgggcgaagagaaagaacgagt<terminator>aaaaggtcggtttaaccggcctttttat t</terminator>ttat
+
*(b)And the final command is as follows:
  
 +
<code> cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g' </code>
  
# What is the ''exact'' mRNA sequence that is transcribed from this gene?
 
#*(a)
 
#*(b)
 
# What is the amino acid sequence that is translated from this mRNA?
 
#*(a)
 
#*(b)
 
  
==== Supplementary Information ====
+
* What is the ''exact'' mRNA sequence that is transcribed from this gene?
 +
**In order to get the mRNA sequence I need to get the sequence between the transcription start site and the terminator.  I found it easiest to make new lines based on the mark up tags already there.  From that point I can pick and choose which lines I need to transcribe.  Using sed, I can delete lines.  Example:  <code> sed "2,4D"</code>  So, using this trick, I deleted all unnecessary lines.  From there all nucleotides not deleted should be transcribed into mRNA.  I was going to make new lines by typing out a bunch of different sed commands for each different tag, but I can do it simply by using two.  This puts each tag on its own line: <code> sed "s/>/&\n/g" | sed "s/</\n&/g"</code>.  Now I go through, delete the tags and the useless sequences, remove the extra lines, and transcribe.  Here is the sequence followed by the command.
 +
**(a)
 +
cgguucaaauuacgguagugauaccccagaggauuagauggccaaagaagacaauauugaaaugcaagguaccguucuug
 +
aaacguugccuaauaccauguuccgcguagaguuagaaaacggucacgugguuacugcacacaucuccgguaaaaugcgca
 +
aaaacuacauccgcauccugacgggcgacaaagugacuguugaacugaccccguacgaccugagcaaaggccgcauugu
 +
cuuccguagucgcugauuguuuuaccgccugaugggcgaagagaaagaacgaguaaaaggucgguuuaaccggccuuuuuauu
 +
**(b)
 +
<code>cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1"
 +
| sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g"
 +
| sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1"
 +
| sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g"
 +
| sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g'
 +
| sed "s/>/&\n/g" | sed "s/</\n&/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28D;29D"
 +
| sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g" </code>
  
As a sample answer for the first question, [[Week 2]]’s paper handout sequence would have been marked as follows (line breaks are included only for clarity):
+
* What is the amino acid sequence that is translated from this mRNA?
  agtgta <minus35box>ttgaca</minus35box> tgatagaagcactctac <minus10box>tatatt</minus10box> tcaat
+
**Now I do the same thing as in the mRNA commands except the lines I delete are different.  My goal was to keep the nucleotides between the start and stop codon and get rid of everything else.  I used the same commands as in the mRNA, however this time I deleted all the lines except 19 and 21.  This left me with a set of nucleotides.  I broke up the nucleotides into groups of three and called the genetic-cod.sed file to translate the codons.  The result was this amino acid sequence.
<tss>a</tss> ttcctag <rbs>gagg</rbs> tttgacct <start_codon>atg</start_codon> attgaacttgaa...aataccatggta
+
**(a) <code> M A K E D N I E M Q G T V L E T L P N T M F R V E L E N G H V V T A H I S G K M R K N Y I R I L T G D K V T V E L T P Y D L S K G R I V F R S R </code>
<stop_codon>taa</stop_codon> ccca <terminator>gccgccagttccgctggcggcatttt</terminator> aac
+
**(b)
'''Note:''' The commands needed to generate the output above will be similar, but ''not'' exactly the same as the ones needed for ''infA''.
+
<code> cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g' | sed "s/>/&\n/g" | sed "s/</\n&/g" | sed "1,18D;20D;22,29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/.../& /g" | sed "s/t/u/g" | sed -f genetic-code.sed </code>
  
Base your commands on the following hints/guidelines about the gene, plus your own knowledge learned from the past few weeks:
+
{{Template:Jwoodlee}}
* The consensus sequence for the -10 site is <code>[ct]at[at]at</code>.
+
* The consensus sequence for the -35 site is <code>tt[gt]ac[at]</code>.
+
* The ideal number of base pairs between the -35 and -10 box is 17, counting from the first nucleotide after the end of the -35 sequence up to the last nucleotide before the -10 sequence.
+
* The transcription start site is located at the 12th nucleotide ''after'' the first nucleotide of the -10 box.
+
* A consensus sequence for the ribosome binding site is <code>gagg</code>.
+
* The first half of the terminator “hairpin” is <code>aaaaggt</code>, where the <code>u</code> in the mRNA binds with a <code>g</code> instead of the usual <code>a</code>.
+
* The terminator includes 4 more nucleotides after the hairpin completes.
+
 
+
==== Computer Tips ====
+
 
+
* Remember that <code>sed</code> is line-based, and that you can add and count lines to get certain things done, say strictly before or after a certain point.
+
* Don’t forget how you enforced reading frames in [[Week 3]].
+
* If you do add lines or spaces to get the job done, make sure to ''clean up after yourself'' by removing them from the final answer.
+
* This exercise is difficult enough that you might be thinking to yourself, “I’d rather do this by hand!”  This sentiment is understandable, but when you find yourself feeling this way, consider the following:
+
** Part of the difficulty is learning these things for the first time.  Once you’ve gotten the hang of it, there’s no way that doing things by hand will be faster.
+
** Consider trying to do this over and over, for multiple genes, with lots of potential variations.  Doing this by hand not only takes longer at this point, but risks errors that a computer won’t make (once the correct commands have been determined).
+
* Form your commands so that they can be strung together into a single pipeline of processing directives in the end.  In other words, once you’ve figured out how to do each step, no human intervention should be needed to perform everything from beginning to end.
+
* You will need the [[More Text Processing Features]] wiki page to complete this assignment.  The [[How to Read XML Files]] wiki page gives you an idea for why the requested output was formatted the way it was.
+

Latest revision as of 02:48, 29 September 2015

Transcription and Translation “Taken to the Next Level”

This computer exercise examines gene expression at a much more detailed level than before, requiring knowledge in both the biological aspects of the process and the translation of these steps into computer text-processing equivalents.

The following sequence represents a real gene, called infA and found in E. coli K12. As you might have guessed, it’s stored as infA-E.coli-K12.txt in ~dondi/xmlpipedb/data.

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgc
tcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgtt
gcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc
tttacttatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggcc
aaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaa
cggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtga
ctgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatg
ggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

For each of the following questions pertaining to this gene, provide (a) the actual answer, and (b) the sequence of text-processing commands that calculates this answer. Specific information about how these sequences can be identified is included after the list of questions.

I opened up terminal, and used the ssh command to get into dondi's directory: ~dondi/xmlpipedb/data. In there I got access to infA-E.coli-K12.txt which is the nucleotide sequence I will be using for this assignment.

  • Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):
  • -35 box of the promoter
    • As shown in class I used the sed command to get the first occurrence of the minus 35 strand in the sequence:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1"

  • -10 box of the promoter
    • In order to get the 10 box filled correctly I made a new line after the 35 box, counted 17 bits into the new line, and then made another new line. I then took the first occurrence of the 3rd line that matched the -10 box sequence. This obviously works to find the correct sequence, however it breaks up the entire sequence into 3 lines, so at the end I will have to remedy this.

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>/1"

  • transcription start site
    • Knowing that the TSS was 12 characters away from the first character of the -10 box, I counted 5 characters after the last -10 character, made a new line and then knew that the first character on that new line would be the TSS, how I did that can be seen in the commands below.

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>/g"

  • ribosome binding site
    • To find the ribosome binding site, I made a new line after the TSS and found the next occurrence of "gagg" as per the assignment specifications:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>/1"

  • start codon
    • I found the start codon in much the same way as I found the rest of the sequences. I made a new line after the previous sequence, and then found the first occurrence of "atg", and then marked it up.

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>/1"

  • stop codon
    ... <stop_codon>...</stop_codon> ...
    • The stop codon requires I find one of three possible three character sequences. At first I tried using brackets: "t[ag][ag]", but I soon found out that that yielded too many results. There are only three stop codons and the brackets give me 4 unique codons. So into the wiki I went, and realized I could use a vertical bar to separate three unique codons, and search for them. The problem however, was that this did not work. After being stumped for awhile I realized that before I piped to that command I needed to break up the line into sets of 3, just like I did in the week 3 assignment. As a result I got this command: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1"
  • terminator
    • The first part of the terminator hairpin is: aaaaggt, which means, abiding by the rules of the terminator provided to us, that the first half bonds with gcctttt . So now the trick is to grab the correct terminator sequence. I ended up breaking the terminator command into two different commands. I used one to insert the first tag, and the second one to insert the second tag. I did this because I wasn't sure how long the sequence would be between the two hairpin sequences. This is what I got to capture the terminator sequence: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g"
  • And so, finally, it is all marked up. However I'm not quite done yet, I need to get rid of all the new lines I created. In order to do this I used this command: sed ':a;N;$!ba;s/\n//g' (from wiki), so the final output is as follows.
  • (a)
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcg
gagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>tatttacagaacttcgg<minus10box>cattat</minus10box>cttgc<tss>c</tss>ggttcaaattacggtagtgatacccca
<rbs>gagg</rbs>attag<start_codon>atg</start_codon>gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttac
tgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgc<stop_codon>tga</stop_codon>
ttgttttaccgcctgatgggcgaagagaaagaacgagt<terminator>aaaaggtcggtttaaccggcctttttatt</terminator>ttat
  • (b)And the final command is as follows:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g'


  • What is the exact mRNA sequence that is transcribed from this gene?
    • In order to get the mRNA sequence I need to get the sequence between the transcription start site and the terminator. I found it easiest to make new lines based on the mark up tags already there. From that point I can pick and choose which lines I need to transcribe. Using sed, I can delete lines. Example: sed "2,4D" So, using this trick, I deleted all unnecessary lines. From there all nucleotides not deleted should be transcribed into mRNA. I was going to make new lines by typing out a bunch of different sed commands for each different tag, but I can do it simply by using two. This puts each tag on its own line: sed "s/>/&\n/g" | sed "s/</\n&/g". Now I go through, delete the tags and the useless sequences, remove the extra lines, and transcribe. Here is the sequence followed by the command.
    • (a)
cgguucaaauuacgguagugauaccccagaggauuagauggccaaagaagacaauauugaaaugcaagguaccguucuug
aaacguugccuaauaccauguuccgcguagaguuagaaaacggucacgugguuacugcacacaucuccgguaaaaugcgca
aaaacuacauccgcauccugacgggcgacaaagugacuguugaacugaccccguacgaccugagcaaaggccgcauugu
cuuccguagucgcugauuguuuuaccgccugaugggcgaagagaaagaacgaguaaaaggucgguuuaaccggccuuuuuauu
    • (b)

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g' | sed "s/>/&\n/g" | sed "s/</\n&/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28D;29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g"

  • What is the amino acid sequence that is translated from this mRNA?
    • Now I do the same thing as in the mRNA commands except the lines I delete are different. My goal was to keep the nucleotides between the start and stop codon and get rid of everything else. I used the same commands as in the mRNA, however this time I deleted all the lines except 19 and 21. This left me with a set of nucleotides. I broke up the nucleotides into groups of three and called the genetic-cod.sed file to translate the codons. The result was this amino acid sequence.
    • (a) M A K E D N I E M Q G T V L E T L P N T M F R V E L E N G H V V T A H I S G K M R K N Y I R I L T G D K V T V E L T P Y D L S K G R I V F R S R
    • (b)

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g' | sed "s/>/&\n/g" | sed "s/</\n&/g" | sed "1,18D;20D;22,29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/.../& /g" | sed "s/t/u/g" | sed -f genetic-code.sed


BIOL 367, Fall 2015, User Page, Team Page

Weekly Assignments Individual Journal Pages Shared Journal Pages