Difference between revisions of "Anuvarsh Week 4"
(→Transcription and Translation "Taken to the Next Level": first few steps of hw4) |
(→Transcription and Translation "Taken to the Next Level": finished up to tss tag) |
||
Line 14: | Line 14: | ||
grep "tt[gt]ac[at]" infA-E.coli-K12.txt | grep "tt[gt]ac[at]" infA-E.coli-K12.txt | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | When we ran this test, we noticed that there were 2 instances of this pattern with only two | + | When we ran this test, we noticed that there were 2 instances of this pattern with only two nucleotides between them. Because we understood that the -10 box must occur after the -35 box, we searched for the -10 box sequence while also searching for the -35 box. In this instance, we could not use <code>grep</code> because only one sequence can be searched at any given time. In order to locate both sequences relative to each other, we ran the following command: |
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ ***&*** /g" | sed "s/[ct]at[at]at/ ***&*** /g" | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ ***&*** /g" | sed "s/[ct]at[at]at/ ***&*** /g" | ||
− | ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt ***tataat*** | + | ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt ***tataat*** tgcggtcgcagagttggttacgctca |
− | + | ttaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcgg | |
− | + | cttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc ***tttact*** ta ***tttaca*** | |
− | + | gaacttcgg ***cattat*** cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattg | |
− | + | aaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctc | |
− | + | cggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaagg | |
− | + | ccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggccttttt | |
+ | attttat | ||
At this point it became very clear that the "real" -35 box was the first instance of the sequence, and the "real" -10 box was the second instance of the sequence, or the first instance after the -35 box. We began with tagging the -35 box. In order to replace just the first instance of a sequence using sed, we found that we just needed to replace the "g" in <code>sed "s///g"</code> with "1". This tells sed to only replace the first instance of a sequence. We found this information in the [[More Text Processing Features]] page. The resulting command was as follows: | At this point it became very clear that the "real" -35 box was the first instance of the sequence, and the "real" -10 box was the second instance of the sequence, or the first instance after the -35 box. We began with tagging the -35 box. In order to replace just the first instance of a sequence using sed, we found that we just needed to replace the "g" in <code>sed "s///g"</code> with "1". This tells sed to only replace the first instance of a sequence. We found this information in the [[More Text Processing Features]] page. The resulting command was as follows: | ||
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | ||
+ | ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccg | ||
+ | ctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagcc | ||
+ | gtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> | ||
+ | tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattg | ||
+ | aaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatct | ||
+ | ccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaa | ||
+ | ggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcc | ||
+ | tttttattttat | ||
+ | |||
+ | In order to tag the -10 box, Dondi provided us with a hint that said that we should enter a new line after the -35 box tag. This is beneficial in accurately tagging the -10 box because the "correct" -10 box should be found only a few nucleotides away from the -35 box. By entering a new line, we are able to begin our search for the "correct" -10 box at line 2. In order to enter a new line after the -35 box tag, we referred to the [[More Text Processing Features]] page which indicated that we should use the phrase <code>&\n</code> in order to enter a new line. The resulting command was as follows: | ||
+ | |||
+ | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | ||
+ | ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccg | ||
+ | ataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcgga | ||
+ | gtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> | ||
+ | tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgc | ||
+ | aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgc | ||
+ | gcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgta | ||
+ | gtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat | ||
+ | |||
+ | At this point, all we needed to was search and replace the first instance of the -10 box after the line break. In class, we were provided with a hint that said that in order to start a search and replace at the second line of a set of text, we should modify <code>sed "s///g"</code> to look like <code>sed "2s///g"</code>. This led us to the following command: | ||
+ | |||
+ | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | | ||
+ | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | ||
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgat | ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgat | ||
aaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaa | aaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaa | ||
− | tgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> | + | tgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> |
− | + | tatttacagaacttcgg <minus10box>cattat</minus10box> cttgccggttcaaattacggtagtgataccccagaggattag | |
− | + | atggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgt | |
− | + | ggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacc | |
− | + | tgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggc | |
+ | ctttttattttat | ||
+ | |||
+ | At this point, class had ended and we were on our own. Ron and I split up, and I continued working on this assignment by myself. At this point, the assignment asked us to tag the transcription start site with the hint that it was 12th nucleotide after the first nucleotide in the -10 box. Because the -10 box is 6 nucleotides, and TSS is 12 nucleotides *after* the first, I knew the transcription start side would be the 7th nucleotide after the -10 box. In order to find this, I first entered another line 6 nucleotides after the -10 box using the repetition shortcut as outlined in [[More Text Processing Features]]. I could then tag the first nucleotide in line 3 as the transcription start site. | ||
+ | |||
+ | cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | | ||
+ | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ | ||
+ | <tss>&<\/tss> /g" | ||
+ | ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccga | ||
+ | taaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagta | ||
+ | atgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> | ||
+ | tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc | ||
+ | <tss>g<\tss> gttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaat | ||
+ | accatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacggg | ||
+ | cgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcga | ||
+ | agagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat | ||
+ | |||
+ | |||
+ | |||
+ | |||
{{Template:Anuvarsh}} | {{Template:Anuvarsh}} |
Revision as of 21:02, 27 September 2015
Contents
Transcription and Translation "Taken to the Next Level"
Before anything else, I logged into my account using:
ssh avarshne@my.cs.lmu.edu
And put in my password. Then, I entered the directory within which I copied infA-E.coli-k12.txt from Dondi's library.
cd biodb2015
Modify the gene sequence string so that it highlights or “tags” the special sequences within this gene
In order to complete this task, I reviewed the Introduction to the Command Line page and looked over the More Text Processing Features page. At this point, my partner Ron Legaspi and I were led through the first couple steps of the homework in class. In particular, we learned how to go about adding the -35 box and -10 box tags. In order to do this, we first searched infA-E.coli-K12.txt for all instances of the -35 sequence, which was provided to us as a hint on the homework assignment. In order to do this, we used the following command:
grep "tt[gt]ac[at]" infA-E.coli-K12.txt
When we ran this test, we noticed that there were 2 instances of this pattern with only two nucleotides between them. Because we understood that the -10 box must occur after the -35 box, we searched for the -10 box sequence while also searching for the -35 box. In this instance, we could not use grep
because only one sequence can be searched at any given time. In order to locate both sequences relative to each other, we ran the following command:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ ***&*** /g" | sed "s/[ct]at[at]at/ ***&*** /g" ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt ***tataat*** tgcggtcgcagagttggttacgctca ttaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcgg cttagccgtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc ***tttact*** ta ***tttaca*** gaacttcgg ***cattat*** cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattg aaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctc cggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaagg ccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggccttttt attttat
At this point it became very clear that the "real" -35 box was the first instance of the sequence, and the "real" -10 box was the second instance of the sequence, or the first instance after the -35 box. We began with tagging the -35 box. In order to replace just the first instance of a sequence using sed, we found that we just needed to replace the "g" in sed "s///g"
with "1". This tells sed to only replace the first instance of a sequence. We found this information in the More Text Processing Features page. The resulting command was as follows:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccg ctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagcc gtgtgttttcggagtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattg aaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatct ccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaa ggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcc tttttattttat
In order to tag the -10 box, Dondi provided us with a hint that said that we should enter a new line after the -35 box tag. This is beneficial in accurately tagging the -10 box because the "correct" -10 box should be found only a few nucleotides away from the -35 box. By entering a new line, we are able to begin our search for the "correct" -10 box at line 2. In order to enter a new line after the -35 box tag, we referred to the More Text Processing Features page which indicated that we should use the phrase &\n
in order to enter a new line. The resulting command was as follows:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccg ataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcgga gtaatgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcggcattatcttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgc aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgc gcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgta gtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
At this point, all we needed to was search and replace the first instance of the -10 box after the line break. In class, we were provided with a hint that said that in order to start a search and replace at the second line of a set of text, we should modify sed "s///g"
to look like sed "2s///g"
. This led us to the following command:
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccgat aaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagtaa tgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus10box> cttgccggttcaaattacggtagtgataccccagaggattag atggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgt ggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacc tgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaagagaaagaacgagtaaaaggtcggtttaaccggc ctttttattttat
At this point, class had ended and we were on our own. Ron and I split up, and I continued working on this assignment by myself. At this point, the assignment asked us to tag the transcription start site with the hint that it was 12th nucleotide after the first nucleotide in the -10 box. Because the -10 box is 6 nucleotides, and TSS is 12 nucleotides *after* the first, I knew the transcription start side would be the 7th nucleotide after the -10 box. In order to find this, I first entered another line 6 nucleotides after the -10 box using the repetition shortcut as outlined in More Text Processing Features. I could then tag the first nucleotide in line 3 as the transcription start site.
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/ <minus35box>&<\/minus35box> /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/ <minus10box>&<\/minus10box> /1" | sed -r "s/<\/minus10box> (.){6}/&\n/g" | sed "3s/^./ <tss>&<\/tss> /g" ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgttataattgcggtcgcagagttggttacgctcattaccccgctgccga taaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggagta atgtgccgaacctgtttgttgcgatttagcgcgcaaatc <minus35box>tttact</minus35box> tatttacagaacttcgg <minus10box>cattat</minus10box> cttgcc <tss>g<\tss> gttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgcaaggtaccgttcttgaaacgttgcctaat accatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacggg cgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcga agagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
Other Links
User Page: Anindita Varshneya
Class Page: BIOL/CMSI 367: Biological Databases, Fall 2015
Group Page: GÉNialOMICS
Assignment Pages
Week 1 Assignment
Week 2 Assignment
Week 3 Assignment
Week 4 Assignment
Week 5 Assignment
Week 6 Assignment
Week 7 Assignment
Week 8 Assignment
Week 9 Assignment
Week 10 Assignment
Week 11 Assignment
Week 12 Assignment
No Week 13 Assignment
Week 14 Assignment
Week 15 Assignment
Individual Journals
Individual Journal Week 2
Individual Journal Week 3
Individual Journal Week 4
Individual Journal Week 5
Individual Journal Week 6
Individual Journal Week 7
Individual Journal Week 8
Individual Journal Week 9
Individual Journal Week 10
Individual Journal Week 11
Individual Journal Week 12
Individual Journal Week 14
Individual Journal Week 15
Class Journal Week 1
Class Journal Week 2
Class Journal Week 3
Class Journal Week 4
Class Journal Week 5
Class Journal Week 6
Class Journal Week 7
Class Journal Week 8
Class Journal Week 9
GÉNialOMICS Journal Week 10
GÉNialOMICS Journal Week 11
GÉNialOMICS Journal Week 12
GÉNialOMICS Journal Week 14
GÉNialOMICS Journal Week 15