Malverso Week 4

Transcription and Translation "Taken to the Next Level"

I completed this assignment using Putty.exe by accessing infA-E.coli-K12.txt in ~dondi/xmlpipedb/data. I added line breaks in the sequences and commands when necessary to enhance readability. I also re read the directions halfway through this assignment and decided to "clean up" all of my answers by removing the line breaks that would have shown up in the command line by adding the sed command ':a;N;$!ba;s/\n//g'.

I also fixed an error on 9/29 --- I did not notice that there was supposed to be a space before and after the tags.

#1

-35 box of the promoter

Looking over my notes of when I first attempted this assignment in class, I used the sed command to find all the places where the pattern tt[gt]ac[at] occurred and attached a <minus35box> tag to the beginning of that sequence and a </minus35box> to the end. This resulted with two possible locations for the -35 box being tagged.
Since it was given that there were 17 base pairs between the -35 box and the -10 box, I used that clue to identify which -35 box tags were correct, as well as a bit pf guess and check. I just assumed the first -35 box tag was correct and modified my sed command to only tag the first instance of the -35 box pattern by replacing the g at the end with a 1. I also added an \n to the end of the box tags so that a new line would start after the last -35 box tag. Referring back to my in class notes, I added a sed command on to search for the base pair possibilities for the -10 box as well as a command that would show me the point after 17 characters from the end of my -35 box:

cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" 
| sed -r "2s/^(.){17}/&<here?>/g" | sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>/g"

This confirmed that the first instance of the -35 box pattern match was the correct one. To calculate just where the -35 box is, I can now use the code:

cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1"

Which produces:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttac
gctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcccgctcccttatac
gttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgc
aaatc <minus35box>tttact</minus35box> tatttacagaacttcggcattatcttgccggttcaaatt
acggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgt
tgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgca
aaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgca
ttgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaacc
ggcctttttattttat

-10 box of the promoter

By figuring out which -35 box tags were correct I also figured out which -10 box tags were correct. I added more line breaks to make this clear, and used the code below to produce the code with the correct -10 box tags and correct -35 box tags:

 cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1" 
| sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/ <minus10box>&<\/minus10box> /g" 
| sed ':a;N;$!ba;s/\n//g'

Which returned:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacg
ctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacg
ttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaa
atc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus
10box> cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgc
aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcaca
catctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtac
gacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgag
taaaaggtcggtttaaccggcctttttattttat

transcription start site

It was given that the tss is at the 12th nucleotide after the first nucleotide of the -10 box. I put a line break after the -10 box and then used the sed command: sed -r "4s/^(.){7}/&<tss>/g" to tag the beginning of the tss. Since the -10 box is 6 nucleotides, I put the tss tag before the 6th character after the line break. I wasn't sure if it should be at the 6th or 7th character after, so I asked Mahrad. I decided the 7th makes more sense because it is the 12th nucleotide after the first nucleotide of the sequence rather than the first nucleotide after the minus 10 box tag.

This sed command, however, is not at helpful to tag the end of the tss. In order to tag both the beginning and the end of the location, I changed the command to:

cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1" 
| sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/ <minus10box>&<\/minus10box> \n/g" 
| sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./ <tss>&<\/tss> /g" | sed ':a;N;$!ba;s/\n//g'

This way, the nucleotide that was at the tss location would be at the beginning of the line, and I could easily surround it with the appropriate tags, as shown below:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacg
ctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacg
ttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaa
atc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus
10box> cttgcc <tss>g</tss> gttcaaattacggtagtgataccccagaggattagatggccaaagaagacaa
tattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtg
gttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaac
tgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaaga
gaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

ribosome binding site

It was given that the ribsome binding site was gagg, so I just used a sed command to find an occurence of that pattern after the tss.

I modified the sed command so that only the first instance of the pattern would show, in case the pattern occurred in the sequence more than once:

cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1" 
| sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/ <minus10box>&<\/minus10box> \n/g" 
| sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./ <tss>&<\/tss> \n/g" | sed "6s/gagg/ <rbs>&<\/rbs> /1"
| sed ':a;N;$!ba;s/\n//g'

Which produced the sequence:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgct
cattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgc
gcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <mi
nus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus10box> ct
tgcc <tss>g</tss> gttcaaattacggtagtgatacccca <rbs>gagg</rbs> attagatggccaaagaagaca
atattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtgg
ttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactga
ccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaag
aacgagtaaaaggtcggtttaaccggcctttttattttat

start codon

At this point I am pretty comfortable repeating the same techniques, and I add a line break after the rbs so that I can find only the nearest occurrence of the pattern "atg" and insert the start_codon tags. I needed to refer to my notes to remember that "atg" = start codon.To "clean it up", I added sed ':a;N;$!ba;s/\n//g' to the end to negate all of the line breaks I had added, and the final command sequence is as follows:

cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1" 
| sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/ <minus10box>&<\/minus10box> \n/g" 
| sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./ <tss>&<\/tss> \n/g" | sed "6s/gagg/ <rbs>&<\/rbs> \n/1"
| sed "7s/atg/ <start_codon>&<\/start_codon> /1" | sed ':a;N;$!ba;s/\n//g'

Which produced:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgct
gccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgt
gttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttaca
gaacttcgg <minus10box>cattat</minus10box> cttgcc <tss>g</tss> gttcaaattacggtagtgatacccca <rbs>g
agg</rbs> attag <start_codon>atg</start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgt
tgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgca
tcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttacc
gcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

stop codon

I used the !172 command to get to my history, since I did not complete this homeowrk in one sitting.
In order to see where the stop codon was I separated the nucleotides into sets of three with the sed command "8s/.../& /g". This was also after I added a line break after the start codon tag. Then I added another sed command that would search for the possible stop codons (taa, tag, tga). I had look in my notes to see what the possibel stop codons were.

I searched for all three with the same command, making sure to only take advantage of the first one the program would find, as well as made sure to remove extra line breaks and spaces:

cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1" 
| sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/ <minus10box>&<\/minus10box> \n/g" 
| sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./ <tss>&<\/tss> \n/g" | sed "6s/gagg/ <rbs>&<\/rbs> \n/1" 
| sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g" 
| sed -r "8s/tag |taa |tga / <stop_codon>&<\/stop_codon> /1" | sed ':a;N;$!ba;s/\n//g' | sed "s/ //g"

Which produced:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacg
ctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacg
ttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaa
atc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus
10box> cttgcc <tss>g</tss> gttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start
_codon>atg</start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgccta
ataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaacta
catccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttc
cgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagta
aaaggtcggtttaaccggcctttttattttat

terminator

I remembered from class both how to decipher the nucleotides that would begin and end the hairpin as well as the piece of the sed command (.*) that would allow for the amount of characters between to all be wildcards, and even for the amount of them to be a wildcard. The nucleotides that would match up with the given aaaaggt would be gcctttt because that is the compliments reversed and also, the directions called for the t to be paired with a g instead of an a.

This is the code:

cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1" 
| sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/ <minus10box>&<\/minus10box> \n/g" 
| sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./ <tss>&<\/tss> \n/g" | sed "6s/gagg/ <rbs>&<\/rbs> \n/1" 
| sed "7s/atg/ <start_codon>&<\/start_codon> \n/1" | sed "8s/.../& /g" 
| sed -r "8s/tag|taa|tga/ <stop_codon>&<\/stop_codon> \n/1" | sed "s/ //g"
| sed -r "9s/aaaaggt.*gcctttt..../ <terminator>&<\/terminator> /1" | sed ':a;N;$!ba;s/\n//g'

And this is the final result:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacg
ctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacg
ttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaa
atc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus
10box> cttgcc <tss>g</tss> gttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start
_codon>atg</start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgccta
ataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaacta
catccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttc
cgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagt <
terminator>aaaaggtcggtttaaccggcctttttatt</terminator> ttat

#2

I looked way back in my notes to see that the mRNA transcription begins at the tss and ends at the terminator. At first I tried to use a simple sed command to take out everything before the </tss> tag, but then I realized that would leave in a lot of other tags that I did not want. So I added more sed commands, combining many of them and then I deleted the remainder of the line by replacing </terminator> and everything after it with nothing.

Lastly, I replaced the t's with u's. Her is the command sequence:

cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" 
| sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/g" 
| sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1"
| sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g" 
| sed -r "8s/tag|taa|tga/<stop_codon>&<\/stop_codon>\n/1" | sed "s/ //g" 
| sed -r "9s/aaaaggt.*gcctttt..../<terminator>&<\/terminator>/1" | sed ':a;N;$!ba;s/\n//g'\ 
| sed -r "s/^.*<\/tss>//1" | sed -r "s/<rbs>|<\/rbs>|<start_codon>|<\/start_codon>|<stop_codon>
|<\/stop_codon>|<terminator>//g" | sed "s/<\/terminator>.*//g" | sed "s/t/u/g"

And here is the exact mRNA strand:

gguucaaauuacgguagugauaccccagaggauuagauggccaaagaagacaauauugaaaugcaagguaccguucuugaaacguugcc
uaauaccauguuccgcguagaguuagaaaacggucacgugguuacugcacacaucuccgguaaaaugcgcaaaaacuacauccgcaucc
ugacgggcgacaaagugacuguugaacugaccccguacgaccugagcaaaggccgcauugucuuccguagucgcugauuguuuuaccgc
cugaugggcgaagagaaagaacgaguaaaaggucgguuuaaccggccuuuuuauu

#3

I looked at my notes again to see that the translation to amino acids begins at

the start codon and ends at the stop codon. I used a bunch of sed commands to get rid of all of the unnecessary nucleotides as well as the tags.

Then I used some of the code I wrote for last weeks assignment in order to figure out the correct amino acid sequence.

Here is the command sequence:

cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" 
| sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/g"
| sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" 
| sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g" | sed -r "8s/tag|taa|tga/<stop_codon>&<\/stop_codon>\n/1" 
| sed "s/ //g" | sed -r "9s/aaaaggt.*gcctttt..../<terminator>&<\/terminator>/1" | sed ':a;N;$!ba;s/\n//g' 
| sed -r "s/^.*<start_codon>//g" | sed -r "s/<\/start_codon>|<stop_codon>//g" | sed -r "s/<\/stop_codon>.*//g" 
| sed "s/.../& /g"| sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g"

And here is the amino acid sequence, which I checked with the ExPASy tool:

MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR-

Team Page

Heavy Metal HaterZ

Assignments

Individual Journal Entries

Shared Journal Entries

Malverso Week 4

Contents

Transcription and Translation "Taken to the Next Level"

#1

-35 box of the promoter

-10 box of the promoter

transcription start site

ribosome binding site

start codon

stop codon

terminator

And this is the final result:

#2

#3

Team Page

Assignments

Individual Journal Entries

Shared Journal Entries

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools