Difference between revisions of "Vpachec3 Week 4"

From LMU BioDB 2015
Jump to: navigation, search
(Added start codon tag and commentary on process)
(Adding preliminary steps)
Line 65: Line 65:
  
 
===Stop Codon===
 
===Stop Codon===
 +
Unlike the start codon, the stop codon has more than one possible combination. So what I am thinking is to break the sequence into two lines and putting in the command to search for multiple combination: sed "s/.../& /g"
 +
 +
So I tried this command: 
 +
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <\/minus10box> /&\n/g"| sed "2s/c/ <tss>& <\/tss> /5"|  sed ':a;N;$!ba;s/\n//g'| sed "s/ <\/tss> /&\n/g"| sed "2s/gagg/ <rbs>& <\/rbs> /1" |sed ':a;N;$!ba;s/\n//g'| sed "s/ <\/rbs> /&\n/g"| sed "2s/atg/ <start_codon>& <\/start_codon> /1"| sed ':a;N;$!ba;s/\n//g' | sed "s/ <\/start_codon> /&\n/g"| sed "2s/tga|tag|taa/ <stop_codon>& <\/stop_codon> /g"
 +
 +
 +
And it appeared not to work. Initially, I had no clue why. I tried multiple variations of the command and it still wasn't really working. I tried playing around with it. So my method, not particularly efficient, but still work was to search the 3 tags (tga, tag, taa) individually and see which one of the three came first.
 +
 +
But then I realized the mistake. They need to be read in 3's which I realized I didn't specify in the first command. So now I need to add in the  sed "s/.../& /g" command to let the computer know to read it in threes starting the second line.
  
 
===Terminator===
 
===Terminator===

Revision as of 04:35, 29 September 2015

Modify the Gene Sequence with Tags

-35 Box

-10 Box

My lab partner, Nicole, was a big help and helped me go through the Week 4 homework. Here is how far we got:

vpachec3@ab201:/nfs/home/dondi/xmlpipedb/data$ cat infA-E.coli-K12.txt |sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1"|sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"

This is what the command gave us:

 ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact </minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box>     cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Right before we were stopped to bring it back into a larger group discussion, Nicole taught me that \n would break it into two lines. We just didn't get to apply it to the command line just yet.

Transcription Start Site

Now trying this on my own. I used the \n to break the line to start to figure out how to add the transcription start site. I wanted to break the information into two line so i used this command:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <minus10box>/& \n/g"

However, I wanted to break the line after the minus 10 box so I modified the command:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <\/minus10box> /&\n/g"

This command gave me:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact </minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> 
cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Breaking it into two lines would be easier to insert the transcription start site because we were told:The transcription start site is located at the 12th nucleotide after the first nucleotide of the -10 box.

This means that I could count to insert the transcription start site and use commands that I have used before to insert the tag. I used :

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <\/minus10box> /&\n/g"| sed "2s/cc/ <tts> /1"

However this was problematic because it replace the nucleotide rather than put it in front. Therefore, revision of the command was needed.

New command

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <\/minus10box> /&\n/g"| sed "2s/c/ <tss>& <\/tss> /5"


This command did exactly what I needed:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact </minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> 
cttgccggttcaaatta <tss>c </tss>ggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Finally to tie it ll back together, I used this command at the end : sed ':a;N;$!ba;s/\n//g'

This leaves me with this sequence:

 ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact </minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> cttgccggttcaaatta <tss>c </tss> ggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat


Ribosome Binding Site

I assume that adding the ribosome binding site would be similar to adding in the other tags. The tricky part should be where to put the tag on the sequence. This is the information given:"A consensus sequence for the ribosome binding site is gagg". So I'm going to break the sequence into two lines and look for the gagg pattern after the transcription start site.

So I broke the line into two:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <\/minus10box> /&\n/g"| sed "2s/c/ <tss>& <\/tss> /5"|  sed ':a;N;$!ba;s/\n//g'| sed "s/ <\/tss> /&\n/g"

Then I added the ribosome binding site tag into the second line:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <\/minus10box> /&\n/g"| sed "2s/c/ <tss>& <\/tss> /5"|  sed ':a;N;$!ba;s/\n//g'| sed "s/ <\/tss> /&\n/g"| sed "2s/gagg/ <rbs>& <\/rbs> /1"

Then put the two lines back together to get:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact </minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> cttgccggttcaaatta <tss>c </tss> ggtagtgatacccca <rbs>gagg </rbs> attagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Start Codon

From biology, we know that the start codon is atg and it falls after the ribosome binding site. So I will divide the sequence into two lines once again and add the start codon tag. After this is done, I will put the two lines back together. Now thinking about it, I could have left the lines into 2 or even 3 until all the tags are added then combined the lines after. However, this is good for practice. Here is the final product:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact </minus35box> tatttacagaacttcgg <minus10box>cattat </minus10box> cttgccggttcaaatta <tss>c </tss> ggtagtgatacccca <rbs>gagg </rbs> attag <start_codon>atg </start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

Stop Codon

Unlike the start codon, the stop codon has more than one possible combination. So what I am thinking is to break the sequence into two lines and putting in the command to search for multiple combination: sed "s/.../& /g"

So I tried this command:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>& <\/minus35box> /1" |sed "s/[ct]at[at]at/ <minus10box>& <\/minus10box> /2"| sed "s/ <\/minus10box> /&\n/g"| sed "2s/c/ <tss>& <\/tss> /5"|  sed ':a;N;$!ba;s/\n//g'| sed "s/ <\/tss> /&\n/g"| sed "2s/gagg/ <rbs>& <\/rbs> /1" |sed ':a;N;$!ba;s/\n//g'| sed "s/ <\/rbs> /&\n/g"| sed "2s/atg/ <start_codon>& <\/start_codon> /1"| sed ':a;N;$!ba;s/\n//g' | sed "s/ <\/start_codon> /&\n/g"| sed "2s/tga|tag|taa/ <stop_codon>& <\/stop_codon> /g"


And it appeared not to work. Initially, I had no clue why. I tried multiple variations of the command and it still wasn't really working. I tried playing around with it. So my method, not particularly efficient, but still work was to search the 3 tags (tga, tag, taa) individually and see which one of the three came first.

But then I realized the mistake. They need to be read in 3's which I realized I didn't specify in the first command. So now I need to add in the sed "s/.../& /g" command to let the computer know to read it in threes starting the second line.

Terminator

Links

Vpachec3 User Page