Malverso Week 4
Transcription and Translation "Taken to the Next Level"
I completed this assignment using Putty.exe by accessing infA-E.coli-K12.txt in ~dondi/xmlpipedb/data. I added line breaks in the sequences and commands when necessary to enhance readability. I also re read the directions halfway through this assignment and decided to "clean up" all of my answers by removing the line breaks that would have shown up in the command line by adding the sed command ':a;N;$!ba;s/\n//g'.
I also fixed an error on 9/29 --- I did not notice that there was supposed to be a space before and after the tags.
#1
-35 box of the promoter
- Looking over my notes of when I first attempted this assignment in class, I used the sed command to find all the places where the pattern tt[gt]ac[at] occurred and attached a <minus35box> tag to the beginning of that sequence and a </minus35box> to the end. This resulted with two possible locations for the -35 box being tagged.
- Since it was given that there were 17 base pairs between the -35 box and the -10 box, I used that clue to identify which -35 box tags were correct, as well as a bit pf guess and check. I just assumed the first -35 box tag was correct and modified my sed command to only tag the first instance of the -35 box pattern by replacing the g at the end with a 1. I also added an \n to the end of the box tags so that a new line would start after the last -35 box tag. Referring back to my in class notes, I added a sed command on to search for the base pair possibilities for the -10 box as well as a command that would show me the point after 17 characters from the end of my -35 box:
cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | sed -r "2s/^(.){17}/&<here?>/g" | sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>/g"
This confirmed that the first instance of the -35 box pattern match was the correct one. To calculate just where the -35 box is, I can now use the code:
cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1"
Which produces:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttac gctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcccgctcccttatac gttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgc aaatc <minus35box>tttact</minus35box> tatttacagaacttcggcattatcttgccggttcaaatt acggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgt tgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgca aaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgca ttgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaacc ggcctttttattttat
-10 box of the promoter
- By figuring out which -35 box tags were correct I also figured out which -10 box tags were correct. I added more line breaks to make this clear, and used the code below to produce the code with the correct -10 box tags and correct -35 box tags:
cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1" | sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/ <minus10box>&<\/minus10box> /g" | sed ':a;N;$!ba;s/\n//g'
Which returned:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacg ctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacg ttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaa atc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus 10box> cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgc aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcaca catctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtac gacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgag taaaaggtcggtttaaccggcctttttattttat
transcription start site
- It was given that the tss is at the 12th nucleotide after the first nucleotide of the -10 box. I put a line break after the -10 box and then used the sed command: sed -r "4s/^(.){7}/&<tss>/g" to tag the beginning of the tss. Since the -10 box is 6 nucleotides, I put the tss tag before the 6th character after the line break. I wasn't sure if it should be at the 6th or 7th character after, so I asked Mahrad. I decided the 7th makes more sense because it is the 12th nucleotide after the first nucleotide of the sequence rather than the first nucleotide after the minus 10 box tag.
This sed command, however, is not at helpful to tag the end of the tss. In order to tag both the beginning and the end of the location, I changed the command to:
cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1" | sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/ <minus10box>&<\/minus10box> \n/g" | sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./ <tss>&<\/tss> /g" | sed ':a;N;$!ba;s/\n//g'
This way, the nucleotide that was at the tss location would be at the beginning of the line, and I could easily surround it with the appropriate tags, as shown below:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacg ctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacg ttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaa atc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus 10box> cttgcc <tss>g</tss> gttcaaattacggtagtgataccccagaggattagatggccaaagaagacaa tattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtg gttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaac tgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaaga gaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
ribosome binding site
- It was given that the ribsome binding site was gagg, so I just used a sed command to find an occurence of that pattern after the tss.
I modified the sed command so that only the first instance of the pattern would show, in case the pattern occurred in the sequence more than once:
cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1" | sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/ <minus10box>&<\/minus10box> \n/g" | sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./ <tss>&<\/tss> \n/g" | sed "6s/gagg/ <rbs>&<\/rbs> /1" | sed ':a;N;$!ba;s/\n//g'
Which produced the sequence:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgct cattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgc gcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <mi nus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus10box> ct tgcc <tss>g</tss> gttcaaattacggtagtgatacccca <rbs>gagg</rbs> attagatggccaaagaagaca atattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtgg ttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactga ccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaag aacgagtaaaaggtcggtttaaccggcctttttattttat
start codon
- At this point I am pretty comfortable repeating the same techniques, and I add a line break after the rbs so that I can find only the nearest occurrence of the pattern "atg" and insert the start_codon tags. I needed to refer to my notes to remember that "atg" = start codon.To "clean it up", I added sed ':a;N;$!ba;s/\n//g' to the end to negate all of the line breaks I had added, and the final command sequence is as follows:
cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1" | sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/ <minus10box>&<\/minus10box> \n/g" | sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./ <tss>&<\/tss> \n/g" | sed "6s/gagg/ <rbs>&<\/rbs> \n/1" | sed "7s/atg/ <start_codon>&<\/start_codon> /1" | sed ':a;N;$!ba;s/\n//g'
Which produced:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgct gccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgt gttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttaca gaacttcgg <minus10box>cattat</minus10box> cttgcc <tss>g</tss> gttcaaattacggtagtgatacccca <rbs>g agg</rbs> attag <start_codon>atg</start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgt tgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgca tcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttacc gcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
stop codon
- I used the !172 command to get to my history, since I did not complete this homeowrk in one sitting.
- In order to see where the stop codon was I separated the nucleotides into sets of three with the sed command "8s/.../& /g". This was also after I added a line break after the start codon tag. Then I added another sed command that would search for the possible stop codons (taa, tag, tga). I had look in my notes to see what the possibel stop codons were.
I searched for all three with the same command, making sure to only take advantage of the first one the program would find, as well as made sure to remove extra line breaks and spaces:
cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1" | sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/ <minus10box>&<\/minus10box> \n/g" | sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./ <tss>&<\/tss> \n/g" | sed "6s/gagg/ <rbs>&<\/rbs> \n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g" | sed -r "8s/tag |taa |tga / <stop_codon>&<\/stop_codon> /1" | sed ':a;N;$!ba;s/\n//g' | sed "s/ //g"
Which produced:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacg ctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacg ttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaa atc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus 10box> cttgcc <tss>g</tss> gttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start _codon>atg</start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgccta ataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaacta catccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttc cgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagta aaaggtcggtttaaccggcctttttattttat
terminator
- I remembered from class both how to decipher the nucleotides that would begin and end the hairpin as well as the piece of the sed command (.*) that would allow for the amount of characters between to all be wildcards, and even for the amount of them to be a wildcard. The nucleotides that would match up with the given aaaaggt would be gcctttt because that is the compliments reversed and also, the directions called for the t to be paired with a g instead of an a.
This is the code:
cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1" | sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/ <minus10box>&<\/minus10box> \n/g" | sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./ <tss>&<\/tss> \n/g" | sed "6s/gagg/ <rbs>&<\/rbs> \n/1" | sed "7s/atg/ <start_codon>&<\/start_codon> \n/1" | sed "8s/.../& /g" | sed -r "8s/tag|taa|tga/ <stop_codon>&<\/stop_codon> \n/1" | sed "s/ //g" | sed -r "9s/aaaaggt.*gcctttt..../ <terminator>&<\/terminator> /1" | sed ':a;N;$!ba;s/\n//g'
And this is the final result:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacg ctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacg ttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaa atc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus 10box> cttgcc <tss>g</tss> gttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start _codon>atg</start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgccta ataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaacta catccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttc cgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagt < terminator>aaaaggtcggtttaaccggcctttttatt</terminator> ttat
#2
- I looked way back in my notes to see that the mRNA transcription begins at the tss and ends at the terminator. At first I tried to use a simple sed command to take out everything before the </tss> tag, but then I realized that would leave in a lot of other tags that I did not want. So I added more sed commands, combining many of them and then I deleted the remainder of the line by replacing </terminator> and everything after it with nothing.
Lastly, I replaced the t's with u's. Her is the command sequence:
cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/g" | sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g" | sed -r "8s/tag|taa|tga/<stop_codon>&<\/stop_codon>\n/1" | sed "s/ //g" | sed -r "9s/aaaaggt.*gcctttt..../<terminator>&<\/terminator>/1" | sed ':a;N;$!ba;s/\n//g'\ | sed -r "s/^.*<\/tss>//1" | sed -r "s/<rbs>|<\/rbs>|<start_codon>|<\/start_codon>|<stop_codon> |<\/stop_codon>|<terminator>//g" | sed "s/<\/terminator>.*//g" | sed "s/t/u/g"
And here is the exact mRNA strand:
gguucaaauuacgguagugauaccccagaggauuagauggccaaagaagacaauauugaaaugcaagguaccguucuugaaacguugcc uaauaccauguuccgcguagaguuagaaaacggucacgugguuacugcacacaucuccgguaaaaugcgcaaaaacuacauccgcaucc ugacgggcgacaaagugacuguugaacugaccccguacgaccugagcaaaggccgcauugucuuccguagucgcugauuguuuuaccgc cugaugggcgaagagaaagaacgaguaaaaggucgguuuaaccggccuuuuuauu
#3
- I looked at my notes again to see that the translation to amino acids begins at
the start codon and ends at the stop codon. I used a bunch of sed commands to get rid of all of the unnecessary nucleotides as well as the tags.
- Then I used some of the code I wrote for last weeks assignment in order to figure out the correct amino acid sequence.
Here is the command sequence:
cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1" | sed -r "2s/^(.){17}/&\n/g" | sed "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/g" | sed -r "4s/^(.){6}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g" | sed -r "8s/tag|taa|tga/<stop_codon>&<\/stop_codon>\n/1" | sed "s/ //g" | sed -r "9s/aaaaggt.*gcctttt..../<terminator>&<\/terminator>/1" | sed ':a;N;$!ba;s/\n//g' | sed -r "s/^.*<start_codon>//g" | sed -r "s/<\/start_codon>|<stop_codon>//g" | sed -r "s/<\/stop_codon>.*//g" | sed "s/.../& /g"| sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g"
And here is the amino acid sequence, which I checked with the ExPASy tool:
MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR-
Team Page
Assignments
- Week 1
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- Week 13
- Week 14
- Week 15
Individual Journal Entries
- Malverso User Page (Week 1)
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- Week 13
- Week 14
- Week 15