Anuvarsh Week 4

From LMU BioDB 2015
Revision as of 21:51, 27 September 2015 by Anuvarsh (Talk | contribs) (finished part 1 of homework)

Jump to: navigation, search

Transcription and Translation "Taken to the Next Level"

Before anything else, I logged into my account using:

   ssh avarshne@my.cs.lmu.edu

And put in my password. Then, I entered the directory within which I copied infA-E.coli-k12.txt from Dondi's library.

   cd biodb2015

Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene

In order to complete this task, I reviewed the Introduction to the Command Line page and looked over the More Text Processing Features page. At this point, my partner Ron Legaspi and I were led through the first couple steps of the homework in class. In particular, we learned how to go about adding the -35 box and -10 box tags. In order to do this, we first searched infA-E.coli-K12.txt for all instances of the -35 sequence, which was provided to us as a hint on the homework assignment. In order to do this, we used the following command:

   grep "tt[gt]ac[at]" infA-E.coli-K12.txt

When we ran this test, we noticed that there were 2 instances of this pattern with only two nucleotides between them. Because we understood that the -10 box must occur after the -35 box, we searched for the -10 box sequence while also searching for the -35 box. In this instance, we could not use grep because only one sequence can be searched at any given time. In order to locate both sequences relative to each other, we ran the following command:

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/    ***&***    /g" | sed "s/[ct]at[at]at/    ***&***    /g"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt    ***tataat***    tgcggtcgcagagttggttacgctca
   ttaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcgg
   cttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc    ***tttact***    ta    ***tttaca***
   gaacttcgg    ***cattat***   cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattg
   aaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctc
   cggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaagg
   ccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggccttttt
   attttat

At this point it became very clear that the "real" -35 box was the first instance of the sequence, and the "real" -10 box was the second instance of the sequence, or the first instance after the -35 box. We began with tagging the -35 box. In order to replace just the first instance of a sequence using sed, we found that we just needed to replace the "g" in sed "s///g" with "1". This tells sed to only replace the first instance of a sequence. We found this information in the More Text Processing Features page. The resulting command was as follows:

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccg
   ctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagcc
   gtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
   tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattg
   aaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatct
   ccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaa
   ggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcc
   tttttattttat

In order to tag the -10 box, Dondi provided us with a hint that said that we should enter a new line after the -35 box tag. This is beneficial in accurately tagging the -10 box because the "correct" -10 box should be found only a few nucleotides away from the -35 box. By entering a new line, we are able to begin our search for the "correct" -10 box at line 2. In order to enter a new line after the -35 box tag, we referred to the More Text Processing Features page which indicated that we should use the phrase &\n in order to enter a new line. The resulting command was as follows:

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" 
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccg
   ataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcgga
   gtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgc
   aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgc
   gcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgta
   gtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

At this point, all we needed to was search and replace the first instance of the -10 box after the line break. In class, we were provided with a hint that said that in order to start a search and replace at the second line of a set of text, we should modify sed "s///g" to look like sed "2s///g". This led us to the following command:

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgat
   aaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaa
   tgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgccggttcaaattacggtagtgataccccagaggattag
   atggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgt
   ggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacc
   tgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggc
   ctttttattttat

At this point, class had ended and we were on our own. Ron and I split up, and I continued working on this assignment by myself. At this point, the assignment asked us to tag the transcription start site with the hint that it was 12th nucleotide after the first nucleotide in the -10 box. Because the -10 box is 6 nucleotides, and TSS is 12 nucleotides *after* the first, I knew the transcription start side would be the 7th nucleotide after the -10 box. In order to find this, I first entered another line 6 nucleotides after the -10 box using the repetition shortcut as outlined in More Text Processing Features. I could then tag the first nucleotide in line 3 as the transcription start site.

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccga
   taaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagta
   atgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc
    <tss>g<\tss> gttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaa
   acgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgc
   atcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcc
   tgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

In order to tag the ribosome binding site, I followed a similar pattern as earlier where I entered a new line after the transcription start site tag, and searched for the ribosome binding sequence and replaced the first instance of it in line 4. The sequence for the RBS is gagg as outlined in the hints portion of the homework assignment.

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g" | sed "s/<\/tss> /&\n/g" | sed "4s/gagg/ <rbs>&<\/rbs> /1"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataag
   gaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccg
   aacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc
    <tss>g</tss> 
   gttcaaattacggtagtgatacccca <rbs>gagg</rbs> attagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgt
   tgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgac
   gggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcga
   agagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

To find the start codon, I followed the same pattern as before: I first added a new line after the ribosome binding site and then searched for the start codon sequence because the start codon would only exist after the RBS. On the mRNA the start codon is AUG, so the mRNA-like strand of DNA would be ATG. There can only be one start codon, so only the first instance of ATG after the RBS will be tagged.

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g" | sed "s/<\/tss> /&\n/g" | sed "4s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs> /&\n/g" | 
   sed "5s/atg/ <start_codon>&<\/start_codon> /1"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataag
   gaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgcc
   gaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc
    <tss>g</tss> 
   gttcaaattacggtagtgatacccca <rbs>gagg</rbs> 
   attag <start_codon>atg</start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgt
   tccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtg
   actgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacga
   gtaaaaggtcggtttaaccggcctttttattttat

The stop codon can be either UAA, UAG, or UGA on the mRNA, so the stop codon on the mRNA-like strand would read as either TAA, TAG, or TGA. Furthermore, the stop codon must exist a multiply of 3 nucleotides away from the start codon because it must be an even number of codons away from the start codon in order to be considered the correct stop codon. In order to find the stop codon, therefore, I must follow the same procedure as earlier (entering a new line after the previous tag) but before I search and replace the stop codon sequence with the tagged sequence, I must first split all of the nucleotides in line 6 into 3 nucleotide long codons. Then, I can search for the stop codon sequence and tag it. Finally, the spaces between the codons must be removed. When I first did this command, I realized that this removes the spaces surrounding the stop codon tag, and realized that I needed to go back and replace the stop codon tag with the correctly spaced version.

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g" | sed "s/<\/tss> /&\n/g" | sed "4s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs> /&\n/g" | sed "5s/atg/ 
   <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon> /&\n/g" | sed "6s/.../& /g" | sed "6s/t[ag][ag]/ 
   <stop_codon>&<\/stop_codon> /1" | sed "6s/ //g" | sed "6s/<stop_codon>/ <stop_codon>/g" | 
   sed "6s/<\/stop_codon>/<\/stop_codon> /g"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataa
   ggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgc
   cgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc
    <tss>g</tss> 
   gttcaaattacggtagtgatacccca <rbs>gagg</rbs> 
   attag <start_codon>atg</start_codon> 
   gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttact
   gcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaa
   ggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagtaaa
   aggtcggtttaaccggcctttttattttat

The terminator sequence was the most challenging sequence to tag. The hint in the homework assignment informed me that the first half of the terminator hairpin sequence is "AAAAGGT". The other half of the terminator sequence would need to be complementary to this strand and in the reverse order with the exceptions that the T would bind with a G. This meant that the other half of the terminator hairpin sequence would need to be "GCCTTTT". The hint in the homework also informed me that the terminator sequence does not end until 4 nucleotides after the end of the second half of the hairpin sequence. Given this information, I was able to vaguely string together which portion of the sequence I would need to tag. However, the pattern that I used for the previous search and tag's wouldn't be as useful since I do not know the number of nucleotides between the first and second half of the hairpin sequence. In order to tag the terminator, I decided to tag the first part first, enter a new line, and then tag the last part.

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g" | sed "s/<\/tss> /&\n/g" | sed "4s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs> /&\n/g" | sed "5s/atg/ 
   <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon> /&\n/g" | sed "6s/.../& /g" | sed "6s/t[ag][ag]/ 
   <stop_codon>&<\/stop_codon> /1" | sed "6s/ //g" | sed "6s/<stop_codon>/ <stop_codon>/g" | 
   sed "6s/<\/stop_codon>/<\/stop_codon> /g" | sed "s/<\/stop_codon> /&\n/g" | sed "7s/aaaaggt/ <terminator>& \n/1" | 
   sed "8s/gcctttt..../&<\/terminator> /1"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataag
   gaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgcc
   gaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc
    <tss>g</tss> 
   gttcaaattacggtagtgatacccca <rbs>gagg</rbs> 
   attag <start_codon>atg</start_codon> 
   gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttact
   gcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaa
   ggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon> 
   ttgttttaccgcctgatgggcgaagagaaagaacgagt <terminator>aaaaggt 
   cggtttaaccggcctttttatt</terminator> ttat

The only thing left to do is to remove all of the new lines that I created. In order to do this, I referred to the More Text Processing Features page and found the command for combining lines.

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g" | sed "s/<\/tss> /&\n/g" | sed "4s/gagg/ <rbs>&<\/rbs> /1" | sed "s/<\/rbs> /&\n/g" | sed "5s/atg/ 
   <start_codon>&<\/start_codon> /1" | sed "s/<\/start_codon> /&\n/g" | sed "6s/.../& /g" | sed "6s/t[ag][ag]/ 
   <stop_codon>&<\/stop_codon> /1" | sed "6s/ //g" | sed "6s/<stop_codon>/ <stop_codon>/g" | 
   sed "6s/<\/stop_codon>/<\/stop_codon> /g" | sed "s/<\/stop_codon> /&\n/g" | sed "7s/aaaaggt/ <terminator>& \n/1" | 
   sed "8s/gcctttt..../&<\/terminator> /1" | sed ':a;N;$!ba;s/\n//g'
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgataagga
   atttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaatgtgccgaacc
   tgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat
   </minus10box> cttgcc <tss>g</tss> gttcaaattacggtagtgatacccca <rbs>gagg</rbs> attag <start_codon>atg
   </start_codon> gccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcac
   gtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctga
   gcaaaggccgcattgtcttccgtagtcgc <stop_codon>tga</stop_codon> ttgttttaccgcctgatgggcgaagagaaagaacgagt 
   <terminator>aaaaggt cggtttaaccggcctttttatt</terminator> ttat

Other Links

User Page: Anindita Varshneya
Class Page: BIOL/CMSI 367: Biological Databases, Fall 2015
Group Page: GÉNialOMICS

Assignment Pages

Week 1 Assignment
Week 2 Assignment
Week 3 Assignment
Week 4 Assignment
Week 5 Assignment
Week 6 Assignment
Week 7 Assignment
Week 8 Assignment
Week 9 Assignment
Week 10 Assignment
Week 11 Assignment
Week 12 Assignment
No Week 13 Assignment
Week 14 Assignment
Week 15 Assignment

Individual Journals

Individual Journal Week 2
Individual Journal Week 3
Individual Journal Week 4
Individual Journal Week 5
Individual Journal Week 6
Individual Journal Week 7
Individual Journal Week 8
Individual Journal Week 9
Individual Journal Week 10
Individual Journal Week 11
Individual Journal Week 12
Individual Journal Week 14
Individual Journal Week 15

Shared Journals

Class Journal Week 1
Class Journal Week 2
Class Journal Week 3
Class Journal Week 4
Class Journal Week 5
Class Journal Week 6
Class Journal Week 7
Class Journal Week 8
Class Journal Week 9
GÉNialOMICS Journal Week 10
GÉNialOMICS Journal Week 11
GÉNialOMICS Journal Week 12
GÉNialOMICS Journal Week 14
GÉNialOMICS Journal Week 15