Bklein7 Week 4

Transcription and Translation "Taken to the Next Level"

This assignment centers on text manipulation of the nucleotide sequence present in the file infA-E.coli-K12.txt. Therefore, I began this assignment by accessing the directory ~dondi/xmlpipedb/data and copying this file to my personal directory using the command cp infA-E.coli-K12.txt bklein7. From there, I evoked cd to return to the directory bklein7 and begin the assignment.

Using Code to Tag the Nucleotide Sequence (Part #1)

Minus 35 Box
- I began by identifying the minus 35 box. Using the tips from the assignment, I used the command grep "tt[gt]ac[at]" infA-E.coli-K12.txt to identify sequences matches for the minus 35 box. There were two matches.
- To identify which match was the the minus 35 box, I performed the command grep "[ct]at[at]at" infA-E.coli-K12.txt to identify possible minus 10 box matches. Because we know the minus 10 box match must occur 17 nucleotides after the minus 35 box, only one pair of matches for the minus 35 & 10 boxes was possible. Although this process does require some subjective analysis, we discussed in class that this was permissible in this case to initiate the tagging process.
- Having identified the minus 35 box as the first of the two grep matches, I used a sed command to tag this sequence.
  1. The sed command was in the form s///1 to manipulate the first match to the sequence tt[gt]ac[at].
  2. The "\" symbol was used to escape the forward slash in the tag.
  3. A new line was started after the tag to make it simpler to perform subsequent operations on the sequence.
  4. The command is as follows:

sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt

Minus 10 Box
- After identifying the minus 35 box, I broke the nucleotide sequence up into two lines as detailed above. Because we know that the minus 10 box exists 17 nucleotides after the minus 35 box and is 6 nucleotides long, this meant that the minus 10 box presumably would always consist of characters 18-23 in line 2 of the code. I verified this for the sequence present in infA-E.coli-K12.txt by checking that the sequence 17 nucleotides after the tagged minus 35 box did indeed match the minus 10 box sequence.
- To automate the tagging of the minus 10 box, I added a series of two separate sed commands to the pipe that would subjectively find this sequence based on its location in line 2.
  1. I wrote the first sed command to add the tag <minus10box> after the 17th character of the second line, taking advantage of the sed -r repetition shortcut.
  2. I wrote the second sed command to add the tag </minus10box> 6 characters after the first minus 10 box tag.
  3. With the addition of these two commands, my command sequence was as follows:

sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.    
{17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box>/g"

tss
- Because the transcription start site is located 12 nucleotides from the beginning of the 6 nucleotide long minus 10 box, I reasoned that it would be simplest to calculate its location as 6 characters after the </minus10box> tag. This means that there are 5 characters between that tag before the <tss> tag should be inserted. Subsequently, the </tss> end tag should be placed one character after the tss start tag.
  1. To accomplish this, I strung together two sed commands using a semicolon to both insert the start and end tags at the right locations.
  2. I piped this command to the growing sequence:

sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.    
{17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box>/g" | sed -r 
"s/<\/minus10box>.{5}/&<tss>/g;s/<tss>./&<\/tss>\n/g"

rbs
- The consensus sequence for the ribosome binding site is gagg. Although a quick grep search shows there is only one instance of this sequence in infA-E.coli-K12.txt, I chose to write code that would label only the first instance of gagg after the tss (as there could be many instances of this sequence in a long gene). To accomplish this, I wrote a sed command to insert the rbs labels before and after the first occurrence of gagg in the third line (this line begins after the tss). In addition, I decided to start a new line after the rbs. I piped this command to the sequence:

sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.    
{17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box>/g" | sed -r 
"s/<\/minus10box>.{5}/&<tss>/g;s/<tss>./&<\/tss>\n/g" | sed "3s/gagg/<rbs>&<\/rbs>\n/1"

start codon
- The method I used for labelling the start codon was nearly identical to that used to label the rbs above. The only difference was that the sed command I wrote was specified to work on the 4th line (beginning after the rbs) instead of the third line. With the addition of this command, the new command sequence labelled the first instance of the sequence atg after the rbs as the start codon. The updated version of the code was as follows:

sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.    
{17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box>/g" | sed -r 
"s/<\/minus10box>.{5}/&<tss>/g;s/<tss>./&<\/tss>\n/g" | sed "3s/gagg/<rbs>&<\/rbs>\n/1"
sed "4s/atg/<start_codon>&<\/start_codon>\n/1"

terminator
- I chose to tag the terminator region prior to tagging the stop codon. This is because I found it most sensible to separate the region of the sequence between the start codon and terminator into its own line prior to writing code to find and tag the stop codon.
- The terminator region in this sequence begins with the sequence aaaaggt. Additionally, after going through the hairpin base pairing exercise for this sequence in class, I knew the terminator hairpin ended with the sequence gcctttt. However, there are 4 more nucleotides after this before the actual end of the terminator region.
- Thus, I set out to tag the terminator knowing its end points.
  1. First, I wrote a command to identify the terminator start sequence and insert a line break as well as the <terminator> tag prior to it. The line break would function by partitioning off the region of the sequence between the start codon and the terminator.
  2. Next, I wrote a command using the wildcard function to identify the entire terminator hairpin sequence, account for the remaining 4 nucleotides in the terminator, and place the </terminator> tag at its end. It wasn't necessary to introduce a new line break after this tag.
  3. The updated code is as follows:

sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.    
{17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box>/g" | sed -r 
"s/<\/minus10box>.{5}/&<tss>/g;s/<tss>./&<\/tss>\n/g" | sed "3s/gagg/<rbs>&<\/rbs>\n/1"
sed "4s/atg/<start_codon>&<\/start_codon>\n/1" | sed "5s/aaaaggt/\n<terminator>&/g" | 
sed -r "6s/aaaaggt.*gcctttt.{4}/&<\/terminator>/g"

stop codon
- Having isolated the region of the sequence between the start codon and the terminator region in line 5, I set out to identify the stop codon within this line. This presented two main challenges. First, it was necessary to search for all three possible stop codons at once in order to identify the first occurrence of a stop codon within the line. Second, it was also necessary to account for reading frames when finding the stop codon.
- To address the above problems, I found coding solutions.
  1. To force sed to search for the stop codon in the same reading frame as the start codon, I added a sed command to add a space every three characters in the 5th line.
  2. Next, I read [here] that the sed -r multiple choice function could be evoked using a vertical bar between different search options. Combining this with the ability to search by line and to replace only the first occurrence of the search term, I wrote a sed command to replace the first instance of any of the three stop codons in the 5th line with the stop codon tags and added it to the pipe.
  3. Finally, I added a command to the sequence to delete spaces present in the 5th line that were no longer necessary.
  4. Piping these commands yielded the following sequence:

sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.    
{17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box>/g" | sed -r 
"s/<\/minus10box>.{5}/&<tss>/g;s/<tss>./&<\/tss>\n/g" | sed "3s/gagg/<rbs>&<\/rbs>\n/1"
sed "4s/atg/<start_codon>&<\/start_codon>\n/1" | sed "5s/aaaaggt/\n<terminator>&/g" | 
sed -r "6s/aaaaggt.*gcctttt.{4}/&<\/terminator>/g" | sed"5s/.../& /g" | sed -r "5s/taa|tag|tga/
<stop_codon>&<\/stop_codon>/1" | sed "5s/ //g"

Combining Lines
- To conclude the tagging command sequence, I added one final command to combine the 6 lines that I created to unify the sequence. The final command sequence and its output are listed below:

sed "s/tt[gt]ac[at]/<minus35box>&<\/minux35box>\n/1" infA-E.coli-K12.txt | sed -r "2s/^.    
{17}/&<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box>/g" | sed -r 
"s/<\/minus10box>.{5}/&<tss>/g;s/<tss>./&<\/tss>\n/g" | sed "3s/gagg/<rbs>&<\/rbs>\n/1"
sed "4s/atg/<start_codon>&<\/start_codon>\n/1" | sed "5s/aaaaggt/\n<terminator>&/g" | 
sed -r "6s/aaaaggt.*gcctttt.{4}/&<\/terminator>/g" | sed"5s/.../& /g" | sed -r "5s/taa|tag|tga/
<stop_codon>&<\/stop_codon>/1" | sed "5s/ //g" | sed ':a;N;$!ba;s/\n//g'

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccg
ctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagcc
gtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</mius35box>t
atttacagaacttcgg<minus10box>cattat</minus10box>cttgc<tss>c</tss>ggttcaaattacggtagtga
tacccca<rbs>gagg</rbs>attag<start_codon>atg</start_codon>gccaaagaagacaatattgaaatgca
aggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggta
aaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggcc
gcattgtcttccgtagtcgc<stop_codon>tga</stop_codon>ttgttttaccgcctgatgggcgaagagaaagaacg
agt<terminator>aaaaggtcggtttaaccggcctttttatt</terminator>ttat

Determining the Transcription Product (Part #2)

Transcription begins at the transcription start site and concludes at the end of the terminator region of a gene. In order to determine the transcription product of the gene tagged in the output above, it is necessary to first isolate the sequence starting after <tss> and concluding at </terminator>. After that, it is only necessary to replace t's with u's, as the given sequence represents the "mRNA-like" strand. The main challenge presented here is in deleting the superfluous text in the output above, particularly the tags embedded in the sequence to be transcribed. To accomplish this goal, I took the following steps:

I wrote a command to separate the sequence that is transcribed (tss to end of terminator) from the rest of the gene.
- To do this, I started with the command sequence from part #1. I then added a sed command to insert line breaks after <tss> as well as before </terminator>. The specific command was sed "s/<tss>/&\n/g;s/<\/terminator>/\n&/g".
- Next, I wrote a command to delete the first and third lines: sed "1D;3D".
- Having isolated the sequence that gets transcribed, it was then necessary to edit out all of the tags from the sequence.
  - I first tried to achieve this using the command: sed "s/<.*>//g". However, the wildcard function is greedy and did not work as I intended.
  - Instead, I used a less compact line of code to delete the tags explicitly, regardless of when they occurred in any sequence: sed -r "s/\/|<|>//g;s/tss|rbs|start_codon|stop_codon|terminator//g".
- Finally, I added a command to convert the t's in the mRNA-like strand to u's, representing the process of transcription: sed "s/t/u/g".
- The final command sequence including the above steps and its output was as follows:

sed "s/tt[gt]ac[at]/<minus35box>&<\/mius35box>\n/1" infA-E.coli-K12.txt | sed -r  "2s/^.{17}/&
<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box>/g" | sed -r  "s/<\/minus10box>.{5}
/&<tss>/g;s/<tss>./&<\/tss>\n/g" | sed "3s/gagg/<rbs>&<\/rbs>\n/g" | sed "4s/atg/<start_codon>&
<\/start_codon>\n/1" | sed "5s/aaaaggt/\n<terminator>&/g" | sed -r "6s/aaaaggt.*gcctttt.{4}/&
<\/terminator>/g"| sed "5s/.../& /g" | sed -r "5s/taa|tag|tga/<stop_codon>&<\/stop_codon>/1"
| sed "5s/ //g" | sed ':a;N;$!ba;s/\n//g' | sed "s/<tss>/&\n/g;s/<\/terminator>/\n&/g" | sed "1D;3D"
| sed -r "s/\/|<|>//g;s/tss|rbs|start_codon|stop_codon|terminator//g" | sed "s/t/u/g"

cgguucaaauuacgguagugauaccccagaggauuagauggccaaagaagacaauauugaaaugcaagguaccguucuugaa
acguugccuaauaccauguuccgcguagaguuagaaaacggucacgugguuacugcacacaucuccgguaaaaugcgcaaaa
acuacauccgcauccugacgggcgacaaagugacuguugaacugaccccguacgaccugagcaaaggccgcauugucuuccg
uagucgcugauuguuuuaccgccugaugggcgaagagaaagaacgaguaaaaggucgguuuaaccggccuuuuuauu

Determining the Translation Product (Part #3)

Translation begins at the start codon "aug" and ends at the stop codon, in this case "tga". The file genetic-code.sed can be used to apply the genetic code to an mRNA sequence as in week 3, but the above sequence must first be trimmed to begin with the start codon and end with the stop codon. There are many ways to do this. I chose to go back in the command sequence from part #2 and section off the sequence from the start codon to the stop codon before deleting the tags. After this, I reached back to the individual assignment from week 3 to get the command sequence for translation using the genetic-code.sed. The specific alterations I made to the command sequence in part #2 are listed below:

I deleted the last two commands in the sequence, which consisted of the following: sed -r "s/\/|<|>//g;s/tss|rbs|start_codon|stop_codon|terminator//g" | sed "s/t/u/g".
I added a command to insert line breaks after the <start_codon> tag and before the </stop_codon> tag: sed "s/<start_codon>/&\n/g;s/<\/stop_codon>/\n&/g".
I added a command identical to one used prior in the sequence to delete the first and third lines, therefore isolating the sequence from start to stop codon: sed "1D;3D".
I added a command to delete the remaining tags: sed "s/<\/start_codon>//g;s/<stop_codon>//g".
Finally, I added a slightly more condensed version of the command sequence used in week 3 to carry out translation, also reading the command to convert t's to u's that I had deleted in step 1: sed "s/t/u/g;s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g;s/ //g".
The final command sequence for translation and its corresponding output are listed below:

sed "s/tt[gt]ac[at]/<minus35box>&<\/mius35box>\n/1" infA-E.coli-K12.txt | sed -r  "2s/^.{17}/&
<minus10box>/g" | sed "s/<minus10box>....../&<\/minus10box>/g" | sed -r  "s/<\/minus10box>.{5}
/&<tss>/g;s/<tss>./&<\/tss>\n/g" | sed "3s/gagg/<rbs>&<\/rbs>\n/g" | sed "4s/atg/<start_codon>&
<\/start_codon>\n/1" | sed "5s/aaaaggt/\n<terminator>&/g" | sed -r "6s/aaaaggt.*gcctttt.{4}/&
<\/terminator>/g"| sed "5s/.../& /g" | sed -r "5s/taa|tag|tga/<stop_codon>&<\/stop_codon>/1"
| sed "5s/ //g" | sed ':a;N;$!ba;s/\n//g' | sed "s/<tss>/&\n/g;s/<\/terminator>/\n&/g" | sed "1D;3D"
| sed "s/<start_codon>/&\n/g;s/<\/stop_codon>/\n&/g" | sed "1D;3D" | sed"s/<\/start_codon>//g;
s/<stop_codon>//g" | sed "s/t/u/g;s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g;s/ //g"

MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR-

To verify that the above result was correct, I did some research. I learned that the infA gene in E. coli codes for the protein by the name of Translation initiation factor IF-1. Using [this database], I found a graphic showing the amino acid sequence of Translation initiation factor IF-1.

Links

User Page: Brandon Klein
Team Page: The Class Whoopers

Assignments Pages

Individual Journal Entries

Shared Journal Entries

Bklein7 Week 4

Contents

Transcription and Translation "Taken to the Next Level"

Using Code to Tag the Nucleotide Sequence (Part #1)

Determining the Transcription Product (Part #2)

Determining the Translation Product (Part #3)

Links

Assignments Pages

Individual Journal Entries

Shared Journal Entries

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools