Difference between revisions of "Anuvarsh Week 4"

From LMU BioDB 2015
Jump to: navigation, search
(Transcription and Translation "Taken to the Next Level": first few steps of hw4)
(Transcription and Translation "Taken to the Next Level": finished up to tss tag)
Line 14: Line 14:
  
 
     grep "tt[gt]ac[at]" infA-E.coli-K12.txt
 
     grep "tt[gt]ac[at]" infA-E.coli-K12.txt
    ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgcc
 
    gataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcgg
 
    agtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatctttacttatttacagaacttcggcattatcttgccggttcaaattacggtagt
 
    gataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgta
 
    gagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtga
 
    ctgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaaga
 
    acgagtaaaaggtcggtttaaccggcctttttattttat
 
  
When we ran this test, we noticed that there were 2 instances of this pattern with only two base pairs between them. Because we understood that the -10 box must occur after the -35 box, we searched for the -10 box sequence while also searching for the -35 box. In this instance, we could not use <code>grep</code> because only one sequence can be searched at any given time. In order to locate both sequences relative to each other, we ran the following command:
+
When we ran this test, we noticed that there were 2 instances of this pattern with only two nucleotides between them. Because we understood that the -10 box must occur after the -35 box, we searched for the -10 box sequence while also searching for the -35 box. In this instance, we could not use <code>grep</code> because only one sequence can be searched at any given time. In order to locate both sequences relative to each other, we ran the following command:
  
 
     cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/    ***&***    /g" | sed "s/[ct]at[at]at/    ***&***    /g"
 
     cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/    ***&***    /g" | sed "s/[ct]at[at]at/    ***&***    /g"
     ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt    ***tataat***   tgcggtcgcagagttggttacgctcattaccccgc
+
     ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt    ***tataat***   tgcggtcgcagagttggttacgctca
     tgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcgg
+
     ttaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcgg
     agtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc   ***tttact***    ta    ***tttaca***   gaacttcgg    ***cattat***   
+
     cttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc   ***tttact***    ta    ***tttaca***
     cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcc
+
    gaacttcgg    ***cattat***  cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattg
     taataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacg
+
     aaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctc
     ggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcga
+
     cggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaagg
     agagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
+
     ccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggccttttt
 +
     attttat
  
 
At this point it became very clear that the "real" -35 box was the first instance of the sequence, and the "real" -10 box was the second instance of the sequence, or the first instance after the -35 box. We began with tagging the -35 box. In order to replace just the first instance of a sequence using sed, we found that we just needed to replace the "g" in <code>sed "s///g"</code> with "1". This tells sed to only replace the first instance of a sequence. We found this information in the [[More Text Processing Features]] page. The resulting command was as follows:
 
At this point it became very clear that the "real" -35 box was the first instance of the sequence, and the "real" -10 box was the second instance of the sequence, or the first instance after the -35 box. We began with tagging the -35 box. In order to replace just the first instance of a sequence using sed, we found that we just needed to replace the "g" in <code>sed "s///g"</code> with "1". This tells sed to only replace the first instance of a sequence. We found this information in the [[More Text Processing Features]] page. The resulting command was as follows:
  
 
     cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1"
 
     cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1"
 +
    ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccg
 +
    ctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagcc
 +
    gtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
 +
    tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattg
 +
    aaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatct
 +
    ccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaa
 +
    ggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcc
 +
    tttttattttat
 +
 +
In order to tag the -10 box, Dondi provided us with a hint that said that we should enter a new line after the -35 box tag. This is beneficial in accurately tagging the -10 box because the "correct" -10 box should be found only a few nucleotides away from the -35 box. By entering a new line, we are able to begin our search for the "correct" -10 box at line 2. In order to enter a new line after the -35 box tag, we referred to the [[More Text Processing Features]] page which indicated that we should use the phrase <code>&\n</code> in order to enter a new line. The resulting command was as follows:
 +
 +
    cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g"
 +
    ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccg
 +
    ataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcgga
 +
    gtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
 +
    tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgc
 +
    aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgc
 +
    gcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgta
 +
    gtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
 +
 +
At this point, all we needed to was search and replace the first instance of the -10 box after the line break. In class, we were provided with a hint that said that in order to start a search and replace at the second line of a set of text, we should modify <code>sed "s///g"</code> to look like <code>sed "2s///g"</code>. This led us to the following command:
 +
 +
    cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" |
 +
    sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1"
 
     ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgat
 
     ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgat
 
     aaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaa
 
     aaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaa
     tgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcggcattatct
+
     tgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
     tgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcct
+
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgccggttcaaattacggtagtgataccccagaggattag
     aataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacg
+
     atggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgt
     ggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcg
+
     ggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacc
     aagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
+
     tgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggc
 +
     ctttttattttat
 +
 
 +
At this point, class had ended and we were on our own. Ron and I split up, and I continued working on this assignment by myself. At this point, the assignment asked us to tag the transcription start site with the hint that it was 12th nucleotide after the first nucleotide in the -10 box. Because the -10 box is 6 nucleotides, and TSS is 12 nucleotides *after* the first, I knew the transcription start side would be the 7th nucleotide after the -10 box. In order to find this, I first entered another line 6 nucleotides after the -10 box using the repetition shortcut as outlined in [[More Text Processing Features]]. I could then tag the first nucleotide in line 3 as the transcription start site.
 +
 
 +
    cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" |
 +
    sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./
 +
    <tss>&<\/tss> /g"
 +
    ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccga
 +
    taaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagta
 +
    atgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
 +
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc
 +
    <tss>g<\tss> gttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaat
 +
    accatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacggg
 +
    cgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcga
 +
    agagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
 +
 
 +
 
 +
 
 +
 
  
  
 
{{Template:Anuvarsh}}
 
{{Template:Anuvarsh}}

Revision as of 21:02, 27 September 2015

Transcription and Translation "Taken to the Next Level"

Before anything else, I logged into my account using:

   ssh avarshne@my.cs.lmu.edu

And put in my password. Then, I entered the directory within which I copied infA-E.coli-k12.txt from Dondi's library.

   cd biodb2015

Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene

In order to complete this task, I reviewed the Introduction to the Command Line page and looked over the More Text Processing Features page. At this point, my partner Ron Legaspi and I were led through the first couple steps of the homework in class. In particular, we learned how to go about adding the -35 box and -10 box tags. In order to do this, we first searched infA-E.coli-K12.txt for all instances of the -35 sequence, which was provided to us as a hint on the homework assignment. In order to do this, we used the following command:

   grep "tt[gt]ac[at]" infA-E.coli-K12.txt

When we ran this test, we noticed that there were 2 instances of this pattern with only two nucleotides between them. Because we understood that the -10 box must occur after the -35 box, we searched for the -10 box sequence while also searching for the -35 box. In this instance, we could not use grep because only one sequence can be searched at any given time. In order to locate both sequences relative to each other, we ran the following command:

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/    ***&***    /g" | sed "s/[ct]at[at]at/    ***&***    /g"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt    ***tataat***    tgcggtcgcagagttggttacgctca
   ttaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcgg
   cttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc    ***tttact***    ta    ***tttaca***
   gaacttcgg    ***cattat***   cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattg
   aaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctc
   cggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaagg
   ccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggccttttt
   attttat

At this point it became very clear that the "real" -35 box was the first instance of the sequence, and the "real" -10 box was the second instance of the sequence, or the first instance after the -35 box. We began with tagging the -35 box. In order to replace just the first instance of a sequence using sed, we found that we just needed to replace the "g" in sed "s///g" with "1". This tells sed to only replace the first instance of a sequence. We found this information in the More Text Processing Features page. The resulting command was as follows:

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccg
   ctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagcc
   gtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
   tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattg
   aaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatct
   ccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaa
   ggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcc
   tttttattttat

In order to tag the -10 box, Dondi provided us with a hint that said that we should enter a new line after the -35 box tag. This is beneficial in accurately tagging the -10 box because the "correct" -10 box should be found only a few nucleotides away from the -35 box. By entering a new line, we are able to begin our search for the "correct" -10 box at line 2. In order to enter a new line after the -35 box tag, we referred to the More Text Processing Features page which indicated that we should use the phrase &\n in order to enter a new line. The resulting command was as follows:

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" 
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccg
   ataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcgga
   gtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgc
   aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgc
   gcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgta
   gtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

At this point, all we needed to was search and replace the first instance of the -10 box after the line break. In class, we were provided with a hint that said that in order to start a search and replace at the second line of a set of text, we should modify sed "s///g" to look like sed "2s///g". This led us to the following command:

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgat
   aaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaa
   tgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgccggttcaaattacggtagtgataccccagaggattag
   atggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgt
   ggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacc
   tgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggc
   ctttttattttat

At this point, class had ended and we were on our own. Ron and I split up, and I continued working on this assignment by myself. At this point, the assignment asked us to tag the transcription start site with the hint that it was 12th nucleotide after the first nucleotide in the -10 box. Because the -10 box is 6 nucleotides, and TSS is 12 nucleotides *after* the first, I knew the transcription start side would be the 7th nucleotide after the -10 box. In order to find this, I first entered another line 6 nucleotides after the -10 box using the repetition shortcut as outlined in More Text Processing Features. I could then tag the first nucleotide in line 3 as the transcription start site.

   cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | 
   sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ 
   <tss>&<\/tss> /g"
   ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccga
   taaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagta
   atgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box>
    tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc
    <tss>g<\tss> gttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaat
   accatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacggg
   cgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcga
   agagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat




Other Links

User Page: Anindita Varshneya
Class Page: BIOL/CMSI 367: Biological Databases, Fall 2015
Group Page: GÉNialOMICS

Assignment Pages

Week 1 Assignment
Week 2 Assignment
Week 3 Assignment
Week 4 Assignment
Week 5 Assignment
Week 6 Assignment
Week 7 Assignment
Week 8 Assignment
Week 9 Assignment
Week 10 Assignment
Week 11 Assignment
Week 12 Assignment
No Week 13 Assignment
Week 14 Assignment
Week 15 Assignment

Individual Journals

Individual Journal Week 2
Individual Journal Week 3
Individual Journal Week 4
Individual Journal Week 5
Individual Journal Week 6
Individual Journal Week 7
Individual Journal Week 8
Individual Journal Week 9
Individual Journal Week 10
Individual Journal Week 11
Individual Journal Week 12
Individual Journal Week 14
Individual Journal Week 15

Shared Journals

Class Journal Week 1
Class Journal Week 2
Class Journal Week 3
Class Journal Week 4
Class Journal Week 5
Class Journal Week 6
Class Journal Week 7
Class Journal Week 8
Class Journal Week 9
GÉNialOMICS Journal Week 10
GÉNialOMICS Journal Week 11
GÉNialOMICS Journal Week 12
GÉNialOMICS Journal Week 14
GÉNialOMICS Journal Week 15