Msaeedi23 Week 4

From LMU BioDB 2015
Revision as of 04:32, 29 September 2015 by Msaeedi23 (Talk | contribs) (-10 box marked)

Jump to: navigation, search

Transcription and Translation “Taken to the Next Level”

This computer exercise examines gene expression at a much more detailed level than before, requiring knowledge in both the biological aspects of the process and the translation of these steps into computer text-processing equivalents.

The following sequence represents a real gene, called infA and found in E. coli K12. As you might have guessed, it’s stored as infA-E.coli-K12.txt in ~dondi/xmlpipedb/data.

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgc
tcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgtt
gcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc
tttacttatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggcc
aaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaa
cggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtga
ctgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatg
ggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

For each of the following questions pertaining to this gene, provide (a) the actual answer, and (b) the sequence of text-processing commands that calculates this answer. Specific information about how these sequences can be identified is included after the list of questions.

  1. Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):
    • -35 box of the promoter
      ... <minus35box>...</minus35box> ...
    • -10 box of the promoter
      ... <minus10box>...</minus10box> ...
    • transcription start site
      ... <tss>...</tss> ...
    • ribosome binding site
      ... <rbs>...</rbs> ...
    • start codon
      ... <start_codon>...</start_codon> ...
    • stop codon
      ... <stop_codon>...</stop_codon> ...
    • terminator
      ... <terminator>...</terminator> ...
  2. What is the exact mRNA sequence that is transcribed from this gene?
  3. What is the amino acid sequence that is translated from this mRNA?

-35 Box

  • using the sed command and the given information on the designated sequence I was able to target the -35 box. Inputting the tt[gt]ac[at] into sed and substituting a 1 for g to search for the first occurence of the desired sequence. I ran the code cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ & /1" resulted in the desired output, however I needed to invoke the starting and ending tags. As we discussed in class, preceding a "/" with a "\" will produce keep the "/" when using sed. Using this technique, I came up with the code: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" and it produced my desired output. The sequence came out as:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg 
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcggcattatcttgccggttcaa
attacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgt
agagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactg
accccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggttta
accggcctttttattttat

-10 box

Using a similar sequence of commands as the -35 box and the given target sequence, I was able to mark the -10 box. First I used the command cat infA-E.coli-K12.txt | sed "s/[ct]at[at]at/ & /g" to locate the desired sequence. It came up that there were multiple occurences of this sequence. I needed to target the -10 box sequence as it appeared after the -35 box sequence. To do this we needed to do a similar newline technique after the target sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1". Once this was done, the newline needed to be removed using the sed ':a;N;$!ba;s/\n//g' command. The resulting sequence came out to be:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagg
aatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
</minus10box> cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttga
aacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctg
acgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaag
agaaagaacgagtaaaaggtcggtttaaccggcctttttattttat


Supplementary Information

As a sample answer for the first question, Week 2’s paper handout sequence would have been marked as follows (line breaks are included only for clarity):

agtgta <minus35box>ttgaca</minus35box> tgatagaagcactctac <minus10box>tatatt</minus10box> tcaat
<tss>a</tss> ttcctag <rbs>gagg</rbs> tttgacct <start_codon>atg</start_codon> attgaacttgaa...aataccatggta
<stop_codon>taa</stop_codon> ccca <terminator>gccgccagttccgctggcggcatttt</terminator> aac

Note: The commands needed to generate the output above will be similar, but not exactly the same as the ones needed for infA.

Base your commands on the following hints/guidelines about the gene, plus your own knowledge learned from the past few weeks:

  • The consensus sequence for the -10 site is [ct]at[at]at.
  • The consensus sequence for the -35 site is tt[gt]ac[at].
  • The ideal number of base pairs between the -35 and -10 box is 17, counting from the first nucleotide after the end of the -35 sequence up to the last nucleotide before the -10 sequence.
  • The transcription start site is located at the 12th nucleotide after the first nucleotide of the -10 box.
  • A consensus sequence for the ribosome binding site is gagg.
  • The first half of the terminator “hairpin” is aaaaggt, where the u in the mRNA binds with a g instead of the usual a.
  • The terminator includes 4 more nucleotides after the hairpin completes.

Computer Tips

  • Remember that sed is line-based, and that you can add and count lines to get certain things done, say strictly before or after a certain point.
  • Don’t forget how you enforced reading frames in Week 3.
  • If you do add lines or spaces to get the job done, make sure to clean up after yourself by removing them from the final answer.
  • This exercise is difficult enough that you might be thinking to yourself, “I’d rather do this by hand!” This sentiment is understandable, but when you find yourself feeling this way, consider the following:
    • Part of the difficulty is learning these things for the first time. Once you’ve gotten the hang of it, there’s no way that doing things by hand will be faster.
    • Consider trying to do this over and over, for multiple genes, with lots of potential variations. Doing this by hand not only takes longer at this point, but risks errors that a computer won’t make (once the correct commands have been determined).
  • Form your commands so that they can be strung together into a single pipeline of processing directives in the end. In other words, once you’ve figured out how to do each step, no human intervention should be needed to perform everything from beginning to end.
  • You will need the More Text Processing Features wiki page to complete this assignment. The How to Read XML Files wiki page gives you an idea for why the requested output was formatted the way it was.