Nanguiano Week 4

Transcription and Translation “Taken to the Next Level”

First, I needed to log in to my LMU CS account to access the data used in this weeks assignment.

ssh nanguia1@lion.lmu.edu

Next, I needed to enter the folder that I'd created for the class, and create a new folder for this week's assignment.

cd biodb
mkdir week4

Next, I moved into Dondi's directory so I could obtain the file required for the assignment - infA-E.coli-K12.txt.

cd ~dondi/xmlpipedb/data
cp infA-E.coli-K12.txt ~nanguia1/biodb/week4

Then, I moved into my directory to prepare to do the assignment.

cd ~nanguia1/biodb/week4

For each of the following questions pertaining to this gene, provide (a) the actual answer, and (b) the sequence of text-processing commands that calculates this answer. Specific information about how these sequences can be identified is included after the list of questions.

Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):

-35 box of the promoter
```
... <minus35box>...</minus35box> ...
```
- First, I knew I needed to identify the sequence that I'd be looking for within the file. The week 4 assignment indicated that the consensus sequence for the -35 promoter sequence is tt[gt]ac[at]. In thus, I knew I needed to plug this sequence into sed in order to filter for this sequence. Because I wanted a single replacement of one sequence, I knew that sed s//g would be the best option. My first theory was to try for sed s/tt[gt]ac[at]/ & /g, to put a space on either side of the sequence. This would test whether or not it was finding the sequence correctly, before I put in the tag.
- I tested using the command cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ & /g". However, this command did not work, since it changed every single one that appeared, not just the first! Since I only wanted the first one to be changed, I did some research to find out how to change the first iteration using sed. Using this link from Stack Overflow, I learned that the /g in the command was indicating to change every single iteration. Changing it to /1 would cause it to change only the first iteration! Running cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ & /1" resulted in the output I expected. As a result, all that was left was to find the first and last space and replace then with the starting and ending tags.
- However, this ended up being harder than expected. Because </minus35box> had a / key, sed interpreted that as the end of the input. The forward slash would need to be escaped in order for sed to treat it not as a part of the command, but rather as a string. I knew that in other command line arguments, a backslash placed before the offending character would escape the character, allowing it to be read as a character. This held true for the sed command as well. The final command and output was as follows:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1"

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg 
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccga
acctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcggcattatcttgccggtt
caaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttc
cgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgact
gttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagta
aaaggtcggtttaaccggcctttttattttat

-10 box of the promoter
```
... <minus10box>...</minus10box> ...
```
- Using what I had learned from the previous problem, as well as the hint from the week 4 assignment that indicated that the -10 box was located at [ct]at[at]at), I began to formulate the command. Upon running the command to test to make sure that the sequence was being found correctly (cat infA-E.coli-K12.txt | sed "s/[ct]at[at]at/ & /g", I was surprised to find that there was a match both before and after the location that had been found for the minus 35 box. Knowing that that -10 box comes after the -35 box, and there should be around 17 nucleotides between them, I knew that this time I could not simply change the first match, since the first match would not be correct. It would be the second match that would be correct. However, it is possible that in a string of text, there could be many more than simply 1 incidence of the -10 box sequence before the -35 box. As a result, I wanted to restrict my search to only appear after the -35 box. The way to do this, as stated by the text processing article, was to insert a newline after the target, then search the second line for the text, with the command sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1". The newline could then be removed with the command sed ':a;N;$!ba;s/\n//g'. The final command and output to display the -10 box alongside the -35 box was as follows:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed ':a;N;$!ba;s/\n//g'

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccga
acctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>
cattat</minus10box> cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggta
ccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactaca
tccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcc
tgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

transcription start site
```
... <tss>...</tss> ...
```
- The transcription start site is located at the 12th nucleotide after the first nucleotide of the -10 box. I knew that I first needed to find the -10 box, so I would have to reference the command I used to find the -10 box. However, I knew that I needed to start from the first nucleotide of the box. My first thought was to put a space before the -10 box, so that the character directly before the first nucleotide of the box was a space, using the command sed "s/[ct]at[at]at/ &/2". Then, I would detect this space, and count for 12 characters after it. To test, I put a space directly before the 12th character using the command sed -r "s/ (.){11}/& /g". This correctly displayed the space directly before the 12th character. All that was left was to remove the space before the -10 box, isolate the 12th character and place <tss> before it and </tss> after it in the same way that had been done before. The command and output was as follows:

cat infA-E.coli-K12.txt | sed "s/[ct]at[at]at/ &/2" | sed -r "s/ (.){11}/& <tss>/g" | sed "s/ //1" | sed "s/<tss>./&<\/tss> /g"

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatt
tttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgttt
gttgcgatttagcgcgcaaatctttacttatttacagaacttcggcattatcttgc <tss>c</tss> ggttcaaattacggtagtgataccccagaggatt
agatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttact
gcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgc
attgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

- When combining this code with the previous code, I knew that I'd need to edit the commands I was using. Because the -10 box would be indicated, I would have to take that into account. Because the -10 box consists of a string that is 6 characters long, I know that the character will be the 6th character following the tag. Therefore, I could look for the second > symbol indicating the end of the minus 10 box tag, and look 6 characters over from that, and mark that character. USing this adjusted formatting, the full command and output is currently as follows:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/[ct]at[at]at/ <minus10box>&<\/minus10box> /2" | sed -r "s/> (.){5}/& /2" | sed "s/ / <tss>/5" | sed "s/<tss>./&<\/tss> /g"
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatt
tttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgttt
gttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</min
us10box> cttgc <tss>c</tss> ggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtacc
gttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgca
tcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcga
agagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

ribosome binding site
```
... <rbs>...</rbs> ...
```
- The ribosome binding site is indicated by the consensus sequence gagg, as indicated on the week 4 assignment. Additionally, it has to be after the transcription start site. Therefore, I knew I needed to search after the transcription start site for the ribosome binding site.

start codon
```
... <start_codon>...</start_codon> ...
```
stop codon
```
... <stop_codon>...</stop_codon> ...
```
terminator
```
... <terminator>...</terminator> ...
```

What is the exact mRNA sequence that is transcribed from this gene? What is the amino acid sequence that is translated from this mRNA?

Links

Nicole Anguiano
BIOL 367, Fall 2015

Assignment Links

Individual Journals

Shared Journals

Nanguiano Week 4

Contents

Transcription and Translation “Taken to the Next Level”

Links

Assignment Links

Individual Journals

Shared Journals

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools