Difference between revisions of "Jwoodlee Week 4"

From LMU BioDB 2015
Jump to: navigation, search
(Transcription and Translation “Taken to the Next Level”: added ELN for amino acid sequence)
(added template)
 
(5 intermediate revisions by the same user not shown)
Line 17: Line 17:
  
 
I opened up terminal, and used the ssh command to get into dondi's directory: ~dondi/xmlpipedb/data.  In there I got access to infA-E.coli-K12.txt which is the nucleotide sequence I will be using for this assignment.   
 
I opened up terminal, and used the ssh command to get into dondi's directory: ~dondi/xmlpipedb/data.  In there I got access to infA-E.coli-K12.txt which is the nucleotide sequence I will be using for this assignment.   
# Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):
+
* Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):
 
* -35 box of the promoter
 
* -35 box of the promoter
 
**As shown in class I used the sed command to get the first occurrence of the minus 35 strand in the sequence:
 
**As shown in class I used the sed command to get the first occurrence of the minus 35 strand in the sequence:
Line 58: Line 58:
 
  aaaacuacauccgcauccugacgggcgacaaagugacuguugaacugaccccguacgaccugagcaaaggccgcauugu
 
  aaaacuacauccgcauccugacgggcgacaaagugacuguugaacugaccccguacgaccugagcaaaggccgcauugu
 
  cuuccguagucgcugauuguuuuaccgccugaugggcgaagagaaagaacgaguaaaaggucgguuuaaccggccuuuuuauu
 
  cuuccguagucgcugauuguuuuaccgccugaugggcgaagagaaagaacgaguaaaaggucgguuuaaccggccuuuuuauu
 
 
**(b)
 
**(b)
<code>cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1"  
+
<code>cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1"  
 
| sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g"  
 
| sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g"  
 
| sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1"  
 
| sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1"  
Line 67: Line 66:
 
| sed "s/>/&\n/g" | sed "s/</\n&/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28D;29D"  
 
| sed "s/>/&\n/g" | sed "s/</\n&/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28D;29D"  
 
| sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g" </code>
 
| sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g" </code>
 
 
  
 
* What is the amino acid sequence that is translated from this mRNA?
 
* What is the amino acid sequence that is translated from this mRNA?
**Now I do the same thing as in the mRNA commands except the lines I delete are different.  I want to keep everything between the start and stop codon inclusive.  I then need to pipe that over to the genetic code sed file that dondi made and I will have a completed amino acid sequence.
+
**Now I do the same thing as in the mRNA commands except the lines I delete are different.  My goal was to keep the nucleotides between the start and stop codon and get rid of everything else.  I used the same commands as in the mRNA, however this time I deleted all the lines except 19 and 21.  This left me with a set of nucleotides.  I broke up the nucleotides into groups of three and called the genetic-cod.sed file to translate the codons.  The result was this amino acid sequence.
**(a)
+
**(a) <code> M A K E D N I E M Q G T V L E T L P N T M F R V E L E N G H V V T A H I S G K M R K N Y I R I L T G D K V T V E L T P Y D L S K G R I V F R S R  </code>
 
**(b)
 
**(b)
 +
<code> cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g' | sed "s/>/&\n/g" | sed "s/</\n&/g" | sed "1,18D;20D;22,29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/.../& /g" | sed "s/t/u/g" | sed -f genetic-code.sed </code>
 +
 +
{{Template:Jwoodlee}}

Latest revision as of 02:48, 29 September 2015

Transcription and Translation “Taken to the Next Level”

This computer exercise examines gene expression at a much more detailed level than before, requiring knowledge in both the biological aspects of the process and the translation of these steps into computer text-processing equivalents.

The following sequence represents a real gene, called infA and found in E. coli K12. As you might have guessed, it’s stored as infA-E.coli-K12.txt in ~dondi/xmlpipedb/data.

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgc
tcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgtt
gcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc
tttacttatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggcc
aaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaa
cggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtga
ctgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatg
ggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

For each of the following questions pertaining to this gene, provide (a) the actual answer, and (b) the sequence of text-processing commands that calculates this answer. Specific information about how these sequences can be identified is included after the list of questions.

I opened up terminal, and used the ssh command to get into dondi's directory: ~dondi/xmlpipedb/data. In there I got access to infA-E.coli-K12.txt which is the nucleotide sequence I will be using for this assignment.

  • Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene, as follows (ellipses indicate bases in the sequence; note the spaces before the start tag and after the end tag):
  • -35 box of the promoter
    • As shown in class I used the sed command to get the first occurrence of the minus 35 strand in the sequence:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> \n/1"

  • -10 box of the promoter
    • In order to get the 10 box filled correctly I made a new line after the 35 box, counted 17 bits into the new line, and then made another new line. I then took the first occurrence of the 3rd line that matched the -10 box sequence. This obviously works to find the correct sequence, however it breaks up the entire sequence into 3 lines, so at the end I will have to remedy this.

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>/1"

  • transcription start site
    • Knowing that the TSS was 12 characters away from the first character of the -10 box, I counted 5 characters after the last -10 character, made a new line and then knew that the first character on that new line would be the TSS, how I did that can be seen in the commands below.

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>/g"

  • ribosome binding site
    • To find the ribosome binding site, I made a new line after the TSS and found the next occurrence of "gagg" as per the assignment specifications:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>/1"

  • start codon
    • I found the start codon in much the same way as I found the rest of the sequences. I made a new line after the previous sequence, and then found the first occurrence of "atg", and then marked it up.

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>/1"

  • stop codon
    ... <stop_codon>...</stop_codon> ...
    • The stop codon requires I find one of three possible three character sequences. At first I tried using brackets: "t[ag][ag]", but I soon found out that that yielded too many results. There are only three stop codons and the brackets give me 4 unique codons. So into the wiki I went, and realized I could use a vertical bar to separate three unique codons, and search for them. The problem however, was that this did not work. After being stumped for awhile I realized that before I piped to that command I needed to break up the line into sets of 3, just like I did in the week 3 assignment. As a result I got this command: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1"
  • terminator
    • The first part of the terminator hairpin is: aaaaggt, which means, abiding by the rules of the terminator provided to us, that the first half bonds with gcctttt . So now the trick is to grab the correct terminator sequence. I ended up breaking the terminator command into two different commands. I used one to insert the first tag, and the second one to insert the second tag. I did this because I wasn't sure how long the sequence would be between the two hairpin sequences. This is what I got to capture the terminator sequence: cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g"
  • And so, finally, it is all marked up. However I'm not quite done yet, I need to get rid of all the new lines I created. In order to do this I used this command: sed ':a;N;$!ba;s/\n//g' (from wiki), so the final output is as follows.
  • (a)
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcg
gagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>tatttacagaacttcgg<minus10box>cattat</minus10box>cttgc<tss>c</tss>ggttcaaattacggtagtgatacccca
<rbs>gagg</rbs>attag<start_codon>atg</start_codon>gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttac
tgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgc<stop_codon>tga</stop_codon>
ttgttttaccgcctgatgggcgaagagaaagaacgagt<terminator>aaaaggtcggtttaaccggcctttttatt</terminator>ttat
  • (b)And the final command is as follows:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g'


  • What is the exact mRNA sequence that is transcribed from this gene?
    • In order to get the mRNA sequence I need to get the sequence between the transcription start site and the terminator. I found it easiest to make new lines based on the mark up tags already there. From that point I can pick and choose which lines I need to transcribe. Using sed, I can delete lines. Example: sed "2,4D" So, using this trick, I deleted all unnecessary lines. From there all nucleotides not deleted should be transcribed into mRNA. I was going to make new lines by typing out a bunch of different sed commands for each different tag, but I can do it simply by using two. This puts each tag on its own line: sed "s/>/&\n/g" | sed "s/</\n&/g". Now I go through, delete the tags and the useless sequences, remove the extra lines, and transcribe. Here is the sequence followed by the command.
    • (a)
cgguucaaauuacgguagugauaccccagaggauuagauggccaaagaagacaauauugaaaugcaagguaccguucuug
aaacguugccuaauaccauguuccgcguagaguuagaaaacggucacgugguuacugcacacaucuccgguaaaaugcgca
aaaacuacauccgcauccugacgggcgacaaagugacuguugaacugaccccguacgaccugagcaaaggccgcauugu
cuuccguagucgcugauuguuuuaccgccugaugggcgaagagaaagaacgaguaaaaggucgguuuaaccggccuuuuuauu
    • (b)

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g' | sed "s/>/&\n/g" | sed "s/</\n&/g" | sed "1,10D;12D;14D;16D;18D;20D;22D;24D;26D;28D;29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/t/u/g"

  • What is the amino acid sequence that is translated from this mRNA?
    • Now I do the same thing as in the mRNA commands except the lines I delete are different. My goal was to keep the nucleotides between the start and stop codon and get rid of everything else. I used the same commands as in the mRNA, however this time I deleted all the lines except 19 and 21. This left me with a set of nucleotides. I broke up the nucleotides into groups of three and called the genetic-cod.sed file to translate the codons. The result was this amino acid sequence.
    • (a) M A K E D N I E M Q G T V L E T L P N T M F R V E L E N G H V V T A H I S G K M R K N Y I R I L T G D K V T V E L T P Y D L S K G R I V F R S R
    • (b)

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box>\n/1" | sed -r "2s/^.{17}/&\n/g" | sed -r "3s/[ct]at[at]at/<minus10box>&<\/minus10box>\n/1" | sed -r "4s/^.{5}/&\n/g" | sed "5s/^./<tss>&<\/tss>\n/g" | sed "6s/gagg/<rbs>&<\/rbs>\n/1" | sed "7s/atg/<start_codon>&<\/start_codon>\n/1" | sed "8s/.../& /g"| sed -r "8s/tag|tga|taa/<stop_codon>&<\/stop_codon>/1" | sed "8s/ //g" | sed "8s/aaaaggt/<terminator>&/g" | sed -r "8s/gcctttt..../&<\/terminator>/g" | sed ':a;N;$!ba;s/\n//g' | sed "s/>/&\n/g" | sed "s/</\n&/g" | sed "1,18D;20D;22,29D" | sed ':a;N;$!ba;s/\n//g' | sed "s/.../& /g" | sed "s/t/u/g" | sed -f genetic-code.sed


BIOL 367, Fall 2015, User Page, Team Page

Weekly Assignments Individual Journal Pages Shared Journal Pages