Msaeedi23 Week 4
Contents
Transcription and Translation “Taken to the Next Level”
This computer exercise examines gene expression at a much more detailed level than before, requiring knowledge in both the biological aspects of the process and the translation of these steps into computer text-processing equivalents.
The following sequence represents a real gene, called infA and found in E. coli K12. As you might have guessed, it’s stored as infA-E.coli-K12.txt in ~dondi/xmlpipedb/data.
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgc tcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgtt gcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc tttacttatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggcc aaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaa cggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtga ctgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatg ggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
For each of the following questions pertaining to this gene, provide (a) the actual answer, and (b) the sequence of text-processing commands that calculates this answer. Specific information about how these sequences can be identified is included after the list of questions.
- Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):
- -35 box of the promoter
... <minus35box>...</minus35box> ...
- -10 box of the promoter
... <minus10box>...</minus10box> ...
- transcription start site
... <tss>...</tss> ...
- ribosome binding site
... <rbs>...</rbs> ...
- start codon
... <start_codon>...</start_codon> ...
- stop codon
... <stop_codon>...</stop_codon> ...
- terminator
... <terminator>...</terminator> ...
- -35 box of the promoter
- What is the exact mRNA sequence that is transcribed from this gene?
- What is the amino acid sequence that is translated from this mRNA?
-35 Box
- using the sed command and the given information on the designated sequence I was able to target the -35 box. Inputting the
tt[gt]ac[at]
into sed and substituting a 1 for g to search for the first occurence of the desired sequence. I ran the codecat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ & /1"
resulted in the desired output, however I needed to invoke the starting and ending tags. As we discussed in class, preceding a "/" with a "\" will produce keep the "/" when using sed. Using this technique, I came up with the code:cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1"
and it produced my desired output. The sequence came out as:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcggcattatcttgccggttcaa attacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgt agagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactg accccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggttta accggcctttttattttat
-10 box
- Using a similar sequence of commands as the -35 box and the given target sequence, I was able to mark the -10 box. First I used the command
cat infA-E.coli-K12.txt | sed "s/[ct]at[at]at/ & /g"
to locate the desired sequence. It came up that there were multiple occurences of this sequence. I needed to target the -10 box sequence as it appeared after the -35 box sequence. To do this we needed to do a similar newline technique after the targetsed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1"
. Once this was done, the newline needed to be removed using thesed ':a;N;$!ba;s/\n//g'
command. The resulting sequence came out to be:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttga aacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctg acgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaag agaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
transcription start site
- The code for the -10 box is six nucleotides, so six more directly after this will mark the transcription start site. I can go about this by isolating the sequence with a newline, and then inserting the desired tag. Once this is completed I can remove the newlines. The added command I used for the transcription start site came out to:
2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g" | sed ':a;N;$!ba;s/\n//g'
. The comprehensive command put together was
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g" | sed ':a;N;$!ba;s/\n//g'</code> to produce: ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaag gtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaacta catccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgc ctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
ribosome binding site
- Using the given, that the rbs is indicated by the consensus sequence gagg and that it comes after the tss it was not too difficult to locate and tag. The tss begins on the third line so I searched this line for our target sequence using this code:
3s/gagg/ <rbs>&<\/rbs> /1" | sed ':a;N;$!ba;s/\n//g'
directly after tagging the tss. The complete command came out to be:cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1" | sed ':a;N;$!ba;s/\n//g' and this produced:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgatacccca <rbs>gagg</rbs> attagatggccaaagaagacaat attgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaa tgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctg attgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
start codon
- The start codon, atg, will appear after the rbs. To target this sequence we must do a newline command after the rbs sequence and search the fourth line to target the start codon. The command comes out to:
<code>sed "4s/atg/ <start_codon>&<\/start_codon> /1". The entire command output came out to: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed ':a;N;$!ba;s/\n//g' and produced:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg< /start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtca cgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagc aaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttatttta t
stop codon
- The stop codon is different than targeting the start codon, because the codons must now be read in groups of three and be in the same reading frame as the start codon. A newline command was again used, along with a 5 before the "s/" in sed command to target the fifth line. To account for any of the three stop codons that are possible, tga, tag, or taa it was necessary to simultaneously do a search for all three. This was done using sed "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon>. After the search I needed to eliminate the spaces between the codons while inputting spaces between the specific tags: 1;5s/ //g;5s/<stop_codon>/ &/g;5s/<\/stop_codon>/& /g" | sed ':a;N;$!ba;s/\n//g'. Combining the entire command comes out to: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g;5s/<\/stop_codon>/& /g" | sed ':a;N;$!ba;s/\n//g' and this produced:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg< /start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtca cgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagc aaaggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagtaaaag gtcggtttaaccggcctttttattttat
Terminator
- The hairpin proved to be a challenge, but with the hint that the terminator includes four nucleotides following the hairpin. The hairpin essentially bends around and connects to itself, meaning that the complement will exist in reverse. The resulting sequence will begin with a "g" instead of an "a." Furthermore, this means that "gcctttt" will be included in the terminator. To account for both ends of the loop and newlines I tried the command:
sed "6s/aaaaggt/ <terminator>&\n/g" | sed "7s/gcctttt..../&<\/terminator> /g" | sed ':a;N;$!ba;s/\n//g'
. The completed entire command came out to:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g;5s/<\/stop_codon>/& /g" | sed "6s/aaaaggt/ <terminator>&\n/g" | sed "7s/gcctttt..../&<\/terminator> /g" | sed ':a;N;$!ba;s/\n//g' to produce:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> cttgc <tss>c</tss> ggttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg< /start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtca cgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagc aaaggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagt <ter minator>aaaaggtcggtttaaccggcctttttatt</terminator> ttat
Base your commands on the following hints/guidelines about the gene, plus your own knowledge learned from the past few weeks:
- The consensus sequence for the -10 site is
[ct]at[at]at
. - The consensus sequence for the -35 site is
tt[gt]ac[at]
. - The ideal number of base pairs between the -35 and -10 box is 17, counting from the first nucleotide after the end of the -35 sequence up to the last nucleotide before the -10 sequence.
- The transcription start site is located at the 12th nucleotide after the first nucleotide of the -10 box.
- A consensus sequence for the ribosome binding site is
gagg
. - The first half of the terminator “hairpin” is
aaaaggt
, where theu
in the mRNA binds with ag
instead of the usuala
. - The terminator includes 4 more nucleotides after the hairpin completes.
What is the exact mRNA sequence that is transcribed from this gene?
- This is where I got completely lost. I know the mRNA sequence will be transcribed from the transcription start site up to the terminator. It became too confusing to work backwards. The only thing that makes sense is to newlines before and after every tag to make it easily removable, but this seems like an extended amount of work. Due to the fact that I was completely stumped I glanced at another student's work [blitvak] to help me get going. I realized that I could use sed numerous times in order to target and delete specific sequences. Going off of previous commands used earlier in the assignment, and combining these with certain deletion commands, the command I first wrote down and paper then inputted came out to:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g; 5s/<\/stop_codon>/& /g;5s/<\/stop_codon> /&\n/g" | sed "6s/aaaaggt/ <terminator>&\n/g" | sed "7s/gcctttt..../&<\/terminator> /g" | sed ':a;N;$!ba;s/\n//g' | sed "s/ //g" | sed -r "s/<|>/\n/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28,29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g"
- once each line and tag had been separated I went through and deleted all of the unnecessary lines and removed spaces in the sequence. Finally, I transcribed all of the t's to u's. The mRNA strand came out to be:
cgguucaaauuacgguagugauaccccagaggauuagauggccaaagaagacaauauugaaaugcaagguaccguucuug
aaacguugccuaauaccauguuccgcguagaguuagaaaacggucacgugguuacugcacacaucuccgguaaaaugcgca aaaacuacauccgcauccugacgggcgacaaagugacuguugaacugaccccguacgaccugagcaaaggccgcauugu cuuccguagucgcugauuguuuuaccgccugaugggcgaagagaaagaacgaguaaaaggucgguuuaaccggccuuuuuauu
What is the amino acid sequence that is translated from this mRNA?
In order to invoke the amino acid sequence we must use the genetic-code.sed folder and make sure to account for the three codon separation for proper translation. The command to ensure the correct amino acid sequence came out to: sed -f genetic-code.sed | sed "s/ //g"
. The entire command was as follows:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1;s/<\/minus35box>/&\n/g" | sed -r "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1;2s/> (.){5}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g;3s/gagg/ <rbs>&<\/rbs> /1;3s/<\/rbs> /&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1;4s/<\/start_codon> /&\n/g" | sed -r "5s/.../& /g;5s/tag|tga|taa/ <stop_codon>&<\/stop_codon> /1;5s/ //g;5s/<stop_codon>/ &/g; 5s/<\/stop_codon>/& /g;5s/<\/stop_codon> /&\n/g" | sed "6s/aaaaggt/ <terminator>&\n/g" | sed "7s/gcctttt..../&<\/terminator> /g" | sed ':a;N;$!ba;s/\n//g' | sed "s/ //g" | sed -r "s/<|>/\n/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28,29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/ //g"
The sequence that was produced came out to be:
MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR
Computer Tips
- Remember that
sed
is line-based, and that you can add and count lines to get certain things done, say strictly before or after a certain point. - Don’t forget how you enforced reading frames in Week 3.
- If you do add lines or spaces to get the job done, make sure to clean up after yourself by removing them from the final answer.
- This exercise is difficult enough that you might be thinking to yourself, “I’d rather do this by hand!” This sentiment is understandable, but when you find yourself feeling this way, consider the following:
- Part of the difficulty is learning these things for the first time. Once you’ve gotten the hang of it, there’s no way that doing things by hand will be faster.
- Consider trying to do this over and over, for multiple genes, with lots of potential variations. Doing this by hand not only takes longer at this point, but risks errors that a computer won’t make (once the correct commands have been determined).
- Form your commands so that they can be strung together into a single pipeline of processing directives in the end. In other words, once you’ve figured out how to do each step, no human intervention should be needed to perform everything from beginning to end.
- You will need the More Text Processing Features wiki page to complete this assignment. The How to Read XML Files wiki page gives you an idea for why the requested output was formatted the way it was.
Class Whoopers Team Page
Assignment Links
- Week 1 Assignment
- Week 2 Assignment
- Week 3 Assignment
- Week 4 Assignment
- Week 5 Assignment
- Week 6 Assignment
- Week 7 Assignment
- Week 8 Assignment
- Week 9 Assignment
- Week 10 Assignment
- Week 11 Assignment
- Week 12 Assignment
- Week 14 Assignment
- Week 15 Assignment
Individual Journals
- Week 1
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- Week 13
- Week 14
- Week 15
- Class Journal Week 1
- Class Journal Week 2
- Class Journal Week 3
- Class Journal Week 4
- Class Journal Week 5
- Class Journal Week 6
- Class Journal Week 7
- Class Journal Week 8
- Class Journal Week 9
- The_Class_Whoopers Week 10
- The_Class_Whoopers Week 11
- The_Class_Whoopers Week 12
- The_Class_Whoopers Week 14
- The_Class_Whoopers 15Msaeedi23 (talk) 23:12, 28 September 2015 (PDT)