Difference between revisions of "Kzebrows Week 4"
(Find stop codon.) |
(Find terminator sequence.) |
||
Line 9: | Line 9: | ||
I began by using grep to find the potential -35 box and -10 box because grep highlights the searched pattern in red. I simply entered | I began by using grep to find the potential -35 box and -10 box because grep highlights the searched pattern in red. I simply entered | ||
− | + | cat infA-E.coli-K12.txt | grep "tt[gt]ac[at]" | |
which gave me two possible answers for the -35 box, '''tttact''' and '''tttaca''', both of which fit the pattern. Now it was a matter of finding out which one was the correct one. I also searched for the -10 box using | which gave me two possible answers for the -35 box, '''tttact''' and '''tttaca''', both of which fit the pattern. Now it was a matter of finding out which one was the correct one. I also searched for the -10 box using | ||
− | + | cat infA-E.coli-K12.txt | grep "[ct]at[at]at" | |
which also revealed two potential sites at '''tataat''' and '''cattat'''. I realized that in order to find out which sequences were the correct ones I needed to visualize them both together, but grep doesn't do this, so instead I used sed. To do this, I entered the sed commands as a pipe, and added three space on either side of each occurrence of the consensus sequences (both -35 and -10) in the file to make the sequences more visible.. This is done by adding sed "s/<pattern>/& /g" where <pattern> is what I wish to find and each space after the "&" sign is what I wished to add to each side of the pattern (instructions found [[Introduction to the Command Line | here]]). The pipe looked like this: | which also revealed two potential sites at '''tataat''' and '''cattat'''. I realized that in order to find out which sequences were the correct ones I needed to visualize them both together, but grep doesn't do this, so instead I used sed. To do this, I entered the sed commands as a pipe, and added three space on either side of each occurrence of the consensus sequences (both -35 and -10) in the file to make the sequences more visible.. This is done by adding sed "s/<pattern>/& /g" where <pattern> is what I wish to find and each space after the "&" sign is what I wished to add to each side of the pattern (instructions found [[Introduction to the Command Line | here]]). The pipe looked like this: | ||
− | + | cat infA-E.coli-K12.txt | sed "[ct]at[at]at/ & /g" | sed "tt[gt]ac[at]/ & /g" | |
This made it clear that it was the first -35 box option, '''tttact''', and the second -10 box option, '''cattat''', that I was looking for in this gene. Using this information, it was then much simpler for me to highlight the specific sequences for the assignment. | This made it clear that it was the first -35 box option, '''tttact''', and the second -10 box option, '''cattat''', that I was looking for in this gene. Using this information, it was then much simpler for me to highlight the specific sequences for the assignment. | ||
To highlight the -35 box, I needed to use sed to put <minus35box> on each side of the first option, along with three spaces. To do this, I consulted the Text Processing page of the wiki and found out that to do this I can replace g with the number of the occurrence I wish to change. Because I only needed the first option to be highlighted ('''tttact'''), the command looked like this: | To highlight the -35 box, I needed to use sed to put <minus35box> on each side of the first option, along with three spaces. To do this, I consulted the Text Processing page of the wiki and found out that to do this I can replace g with the number of the occurrence I wish to change. Because I only needed the first option to be highlighted ('''tttact'''), the command looked like this: | ||
− | + | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | |
Next, to highlight the -10 box, I did the same thing except my goal was to add <minus10box> to each side of the second -10 box option. The command looked like this: | Next, to highlight the -10 box, I did the same thing except my goal was to add <minus10box> to each side of the second -10 box option. The command looked like this: | ||
− | + | cat infA-E.coli-K12.txt | sed "s/[ct]at[at]at/ <minus10box>&<\/minus10box> /2" | |
Which highlighted the -10 box, '''cattat'''. | Which highlighted the -10 box, '''cattat'''. | ||
In order to find the transcription start site, I learned from the assignment page that the site is located at the 12th nucleotide after the first nucleotide of the -10 box. This means that the start of transcription was the sixth codon after '''cattat'''. To find this, I broke up the gene and inserted a new line right after the -35 box. In the "picking lines" section of More Text Processing Features, I found that to do this I had to replace sed s///g with sed 2s///g. This command looked like this: | In order to find the transcription start site, I learned from the assignment page that the site is located at the 12th nucleotide after the first nucleotide of the -10 box. This means that the start of transcription was the sixth codon after '''cattat'''. To find this, I broke up the gene and inserted a new line right after the -35 box. In the "picking lines" section of More Text Processing Features, I found that to do this I had to replace sed s///g with sed 2s///g. This command looked like this: | ||
− | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | + | |
+ | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed | ||
+ | "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | ||
I noted that it should be /1, not /2, after the -10 box because since I'm only looking at things after the -35 box it would be the first occurrence of [ct]at[at]at. | I noted that it should be /1, not /2, after the -10 box because since I'm only looking at things after the -35 box it would be the first occurrence of [ct]at[at]at. | ||
My next goal was to find a command that would allow me to skip over 5 more nucleotides to the transcription start site <tss>...</tss> on the 6th nucleotide after the -10 box. I did this by adding the command | My next goal was to find a command that would allow me to skip over 5 more nucleotides to the transcription start site <tss>...</tss> on the 6th nucleotide after the -10 box. I did this by adding the command | ||
− | + | sed -r "s/<\/minus10box> (.){5}/&\n/g" | |
Which indicated that I meant to skip over 5 nucleotides (in the curly braces). the '''-r''' meant each repetition of the pattern. | Which indicated that I meant to skip over 5 nucleotides (in the curly braces). the '''-r''' meant each repetition of the pattern. | ||
This had me starting at the 10th nucleotide, not the 12th. I realized that this was because I had added extra spaces around the <minus10box>...</minus10box>, and the spaces counted as (.). To fix this, I put {7} in curly braces instead of {5}, which gave me a newline at the right nucleotide (the 12th one). Then, to highlight the transcription start site I added | This had me starting at the 10th nucleotide, not the 12th. I realized that this was because I had added extra spaces around the <minus10box>...</minus10box>, and the spaces counted as (.). To fix this, I put {7} in curly braces instead of {5}, which gave me a newline at the right nucleotide (the 12th one). Then, to highlight the transcription start site I added | ||
− | + | sed "3s/^./<tss>&<\/tss> /g" | |
to tell the computer that I wished to add <tss> labels around the first character in the third line. The command looked like this: | to tell the computer that I wished to add <tss> labels around the first character in the third line. The command looked like this: | ||
− | + | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed | |
+ | "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | ||
Next, to find the ribosome binding site (which has to be after the transcription start site), I searched the same line (line 3) for gagg, as hinted by the assignment page. I did this by invoking the command | Next, to find the ribosome binding site (which has to be after the transcription start site), I searched the same line (line 3) for gagg, as hinted by the assignment page. I did this by invoking the command | ||
− | + | sed "3s/^./<tss>&<\/tss> /g" | |
just like I did for the -35 box much earlier. The sequence then looked like this: | just like I did for the -35 box much earlier. The sequence then looked like this: | ||
− | + | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed | |
+ | "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed | ||
+ | "3s/gagg/ <rbs>&\/rbs> /1" | ||
For the next part I needed to find the start codon, f-Met. This is coded for by AUG, but since this is the mRNA-like strand, the sequence is ATG. To find this ATG, I added a new line after the ribosome binding site and used sed to search for the next occurrence of ATG after that. I did this by adding two commands to the pipe, as seen below. This pattern followed the same pattern as the other sites. | For the next part I needed to find the start codon, f-Met. This is coded for by AUG, but since this is the mRNA-like strand, the sequence is ATG. To find this ATG, I added a new line after the ribosome binding site and used sed to search for the next occurrence of ATG after that. I did this by adding two commands to the pipe, as seen below. This pattern followed the same pattern as the other sites. | ||
− | + | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed | |
+ | "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed | ||
+ | "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | ||
Next I was presented with the challenge of finding the stop codon, which is coded for by either TAA, TAG, or TGA on this strand of the DNA. From our Week 3 assignment I remembered that it would be necessary to space out the nucleotides in 3-nucleotide codons in order to find the stop codon, and from [[Introduction to the Command Line | Intro the Command Line]] I was able to recall the command for this, which was sed "s/.../& /g". I invoked this and began a newline using the same command as earlier for a newline (sed "s//&\n/g"). Once everything was separated into codons it became very easy to find the stop codon. All I had to do was add a new line and then tag it. The only difference was that the first term in the pattern was t[ag][ga], with the brackets representing an either/or situation. I then used /1" with the newline in order to find the first occurrence of t[ag][ga]. The pipe looked like this: | Next I was presented with the challenge of finding the stop codon, which is coded for by either TAA, TAG, or TGA on this strand of the DNA. From our Week 3 assignment I remembered that it would be necessary to space out the nucleotides in 3-nucleotide codons in order to find the stop codon, and from [[Introduction to the Command Line | Intro the Command Line]] I was able to recall the command for this, which was sed "s/.../& /g". I invoked this and began a newline using the same command as earlier for a newline (sed "s//&\n/g"). Once everything was separated into codons it became very easy to find the stop codon. All I had to do was add a new line and then tag it. The only difference was that the first term in the pattern was t[ag][ga], with the brackets representing an either/or situation. I then used /1" with the newline in order to find the first occurrence of t[ag][ga]. The pipe looked like this: | ||
− | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1" | + | |
+ | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed | ||
+ | "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed | ||
+ | "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed | ||
+ | "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1" | ||
+ | |||
+ | The final part of this portion of the assignment, locating the terminator, was the hardest. In class, Dr. Dionisio discussed with us how the first half of the sequence is AAAAGGT. Because it is a hairpin, however, I needed to find the reverse of this sequence, which is TGGAAAA, and find the complement, making the sequence ACCTTTT. Then Dr. Dionisio also said that the T binds with a G instead, so the second part of the sequence is actually GCCTTTT. We were also given the hint that there were 4 nucleotides after the terminator sequence. | ||
+ | |||
+ | To start, I added a new line directly after the first half of the sequence which I knew using the newline command. When I tried to do this it wouldn't work, but then I realized it was because I hadn't removed all of the spaces from when I was finding the stop codon. I invoked sed "s/ //g" to get rid of the spaces and proceeded to add a new line after that, then I tagged the AAAAGGT sequence. This command set looked like this: | ||
+ | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed | ||
+ | "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed | ||
+ | "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed | ||
+ | "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1" | sed "s/ //g" | sed "5s/aaaaggt/ | ||
+ | <terminator>& /1" | ||
+ | |||
+ | For the last part, I needed to find where GCCTTTT was. To do this, I first added yet another line after the first half of the sequence. I then searched for the last half of the sequence plus (....) to indicate the four unknown characters after it in the new line. I also needed to remove all of the lines that I had made in highlighting all of these sites, which is done using the command sed ':a;N;$!ba;s/\n//g' from [[More Text Processing Features | More Text Processing Features]]. | ||
+ | |||
+ | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed | ||
+ | "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed | ||
+ | "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed | ||
+ | "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1" | sed "s/ //g" | sed "5s/aaaaggt/ | ||
+ | <terminator>& /1" | sed "s/aaaaggt/&\n/g" | sed "6s/gcctttt..../&<\/terminator> /1" | ||
+ | |||
+ | The last four characters ended up being TTAT. The final set of commands is this: | ||
+ | |||
+ | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed | ||
+ | "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed | ||
+ | "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed | ||
+ | "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1" | sed "s/ //g" | sed "5s/aaaaggt/ | ||
+ | <terminator>& /1" | sed "s/aaaaggt/&\n/g" | sed "6s/gcctttt..../&<\/terminator> /1" | sed ':a;N;$!ba;s/\n//g' | ||
+ | |||
+ | Which gives a final sequence that looks like this: | ||
+ | |||
+ | ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcac | ||
+ | cgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>tatttac | ||
+ | agaacttcgg<minus10box>cattat</minus10box>cttgc<tss>c</tss>ggttcaaattacggtagtgatacccca<rbs>gagg</rbs>attag<start_codon>atg</ | ||
+ | start_codon>gccaaagaagacaatat<stop_codon>tga</stop_codon>aatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtgg | ||
+ | ttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttac cgcctgatgggcgaagagaaagaacgagt <terminator>aaaaggt cggtttaaccggcctttttatt</terminator> ttat |
Revision as of 05:09, 29 September 2015
Transcription and Translation "Taken to the Next Level"
To start this assignment I began by opening Terminal on my laptop. I entered
ssh kzebrows@my.cs.lmu.edu
followed by my password to log into the LMU CMSI database. As I usually do, I entered the following commands in order to enter Dr. Dionisio's directory, list the files in the directory, and choose the appropriate file for this assignment:
~cd dondi/xmlpipedb/data | ls | cat infA-E.coli-K12.txt
This took me to the E.coli file and showed me the nucleotide sequence. To complete this assignment I frequently used this page as a resource.
I began by using grep to find the potential -35 box and -10 box because grep highlights the searched pattern in red. I simply entered
cat infA-E.coli-K12.txt | grep "tt[gt]ac[at]"
which gave me two possible answers for the -35 box, tttact and tttaca, both of which fit the pattern. Now it was a matter of finding out which one was the correct one. I also searched for the -10 box using
cat infA-E.coli-K12.txt | grep "[ct]at[at]at"
which also revealed two potential sites at tataat and cattat. I realized that in order to find out which sequences were the correct ones I needed to visualize them both together, but grep doesn't do this, so instead I used sed. To do this, I entered the sed commands as a pipe, and added three space on either side of each occurrence of the consensus sequences (both -35 and -10) in the file to make the sequences more visible.. This is done by adding sed "s/<pattern>/& /g" where <pattern> is what I wish to find and each space after the "&" sign is what I wished to add to each side of the pattern (instructions found here). The pipe looked like this:
cat infA-E.coli-K12.txt | sed "[ct]at[at]at/ & /g" | sed "tt[gt]ac[at]/ & /g"
This made it clear that it was the first -35 box option, tttact, and the second -10 box option, cattat, that I was looking for in this gene. Using this information, it was then much simpler for me to highlight the specific sequences for the assignment.
To highlight the -35 box, I needed to use sed to put <minus35box> on each side of the first option, along with three spaces. To do this, I consulted the Text Processing page of the wiki and found out that to do this I can replace g with the number of the occurrence I wish to change. Because I only needed the first option to be highlighted (tttact), the command looked like this:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1"
Next, to highlight the -10 box, I did the same thing except my goal was to add <minus10box> to each side of the second -10 box option. The command looked like this:
cat infA-E.coli-K12.txt | sed "s/[ct]at[at]at/ <minus10box>&<\/minus10box> /2"
Which highlighted the -10 box, cattat.
In order to find the transcription start site, I learned from the assignment page that the site is located at the 12th nucleotide after the first nucleotide of the -10 box. This means that the start of transcription was the sixth codon after cattat. To find this, I broke up the gene and inserted a new line right after the -35 box. In the "picking lines" section of More Text Processing Features, I found that to do this I had to replace sed s///g with sed 2s///g. This command looked like this:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1"
I noted that it should be /1, not /2, after the -10 box because since I'm only looking at things after the -35 box it would be the first occurrence of [ct]at[at]at.
My next goal was to find a command that would allow me to skip over 5 more nucleotides to the transcription start site <tss>...</tss> on the 6th nucleotide after the -10 box. I did this by adding the command
sed -r "s/<\/minus10box> (.){5}/&\n/g"
Which indicated that I meant to skip over 5 nucleotides (in the curly braces). the -r meant each repetition of the pattern.
This had me starting at the 10th nucleotide, not the 12th. I realized that this was because I had added extra spaces around the <minus10box>...</minus10box>, and the spaces counted as (.). To fix this, I put {7} in curly braces instead of {5}, which gave me a newline at the right nucleotide (the 12th one). Then, to highlight the transcription start site I added
sed "3s/^./<tss>&<\/tss> /g"
to tell the computer that I wished to add <tss> labels around the first character in the third line. The command looked like this:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g"
Next, to find the ribosome binding site (which has to be after the transcription start site), I searched the same line (line 3) for gagg, as hinted by the assignment page. I did this by invoking the command
sed "3s/^./<tss>&<\/tss> /g"
just like I did for the -35 box much earlier. The sequence then looked like this:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed "3s/gagg/ <rbs>&\/rbs> /1"
For the next part I needed to find the start codon, f-Met. This is coded for by AUG, but since this is the mRNA-like strand, the sequence is ATG. To find this ATG, I added a new line after the ribosome binding site and used sed to search for the next occurrence of ATG after that. I did this by adding two commands to the pipe, as seen below. This pattern followed the same pattern as the other sites.
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1"
Next I was presented with the challenge of finding the stop codon, which is coded for by either TAA, TAG, or TGA on this strand of the DNA. From our Week 3 assignment I remembered that it would be necessary to space out the nucleotides in 3-nucleotide codons in order to find the stop codon, and from Intro the Command Line I was able to recall the command for this, which was sed "s/.../& /g". I invoked this and began a newline using the same command as earlier for a newline (sed "s//&\n/g"). Once everything was separated into codons it became very easy to find the stop codon. All I had to do was add a new line and then tag it. The only difference was that the first term in the pattern was t[ag][ga], with the brackets representing an either/or situation. I then used /1" with the newline in order to find the first occurrence of t[ag][ga]. The pipe looked like this:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1"
The final part of this portion of the assignment, locating the terminator, was the hardest. In class, Dr. Dionisio discussed with us how the first half of the sequence is AAAAGGT. Because it is a hairpin, however, I needed to find the reverse of this sequence, which is TGGAAAA, and find the complement, making the sequence ACCTTTT. Then Dr. Dionisio also said that the T binds with a G instead, so the second part of the sequence is actually GCCTTTT. We were also given the hint that there were 4 nucleotides after the terminator sequence.
To start, I added a new line directly after the first half of the sequence which I knew using the newline command. When I tried to do this it wouldn't work, but then I realized it was because I hadn't removed all of the spaces from when I was finding the stop codon. I invoked sed "s/ //g" to get rid of the spaces and proceeded to add a new line after that, then I tagged the AAAAGGT sequence. This command set looked like this:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1" | sed "s/ //g" | sed "5s/aaaaggt/ <terminator>& /1"
For the last part, I needed to find where GCCTTTT was. To do this, I first added yet another line after the first half of the sequence. I then searched for the last half of the sequence plus (....) to indicate the four unknown characters after it in the new line. I also needed to remove all of the lines that I had made in highlighting all of these sites, which is done using the command sed ':a;N;$!ba;s/\n//g' from More Text Processing Features.
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1" | sed "s/ //g" | sed "5s/aaaaggt/ <terminator>& /1" | sed "s/aaaaggt/&\n/g" | sed "6s/gcctttt..../&<\/terminator> /1"
The last four characters ended up being TTAT. The final set of commands is this:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g" | sed "3s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs>/&\n/g" | sed "4s/atg/ <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon>/&\n/g" | sed "s/.../& /g" | sed "5s/t[ag][ga]/ <stop_codon>&<\/stop_codon> /1" | sed "s/ //g" | sed "5s/aaaaggt/ <terminator>& /1" | sed "s/aaaaggt/&\n/g" | sed "6s/gcctttt..../&<\/terminator> /1" | sed ':a;N;$!ba;s/\n//g'
Which gives a final sequence that looks like this:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcac cgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>tatttac agaacttcgg<minus10box>cattat</minus10box>cttgc<tss>c</tss>ggttcaaattacggtagtgatacccca<rbs>gagg</rbs>attag<start_codon>atg</ start_codon>gccaaagaagacaatat<stop_codon>tga</stop_codon>aatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtgg ttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttac cgcctgatgggcgaagagaaagaacgagt <terminator>aaaaggt cggtttaaccggcctttttatt</terminator> ttat