Blitvak Week 4
From LMU BioDB 2015
Revision as of 22:34, 27 September 2015 by Blitvak (Talk | contribs) (first segment of tss/rbs tagging)
Contents
Individual Journal Assignment Week 4
Finding and tagging the minus35box and the minus10box of the promoter
- I first found infA-E.coli-K12.txt by entering the correct directory,
cd ~dondi/xmlpipedb/data
. - I copied the sequence kept in that file on an external space for future reference/checking.
- I assumed that the sequence is the mRNA-like strand and that it runs from 5'- 3'.
- By reading the Week 4 Assignment Page, I found that the -10 box is generally
[ct]at[at]at
, and that the -35 box is generallytt[gt]ac[at]
- I skimmed over the More Text Processing Features page, and I found that
sed "s/Title/<h1>&<\/h1>/g"
results in an output of <h1>Title</h1>; this command would be useful in tagging the sequence with its various parts. - I then tried
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>/g"| sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>/g"
.- Gave me the output:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt<minus10box>tataat</minus10box>tgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggag taatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>ta<minus35box>tttaca</minus35box>gaacttcgg<minus10box>cattat</minus10box>cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgc aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaa gagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
- I realized that there are two possibilities for the minus35box and one for the minus10box (since the minus10box must come after the minus35box, the first instance of a "minus10box" is to be ignored).
- I looked at the Week 4 Assignment Page, and I found that there is an ideal number of 17 base pairs between the -35 and -10 box. Only
<minus35box>tttact</minus35box>
fits this criteria (is 17 bp away from<minus10box>cattat</minus10box>
). - In the More Text Processing Features page, I found that
sed "s/paragraph/&\n/g"
results in a line-break right after the pattern gets matched/replaced (this would be useful in making the text more manageable for sed, as it is line centric). From referencing that same page, I found that replacing the g in the s///g format with a number, n, will result in sed replacing only the nth match on that line. Some of these sequences, that correspond to specific parts of the DNA strand, result in multiple matches; it would be fairly useful to limit sed to one replacement by making it only work on the nth match. - I realized that adding a number before the s in the s///g format will limit sed to that line (ex. executing 2s///g, results in a replacement being made on the 2nd line). It should be useful to make several line-breaks in order to make any matching easier; later, these line-breaks will have to be removed.
- In an in-class work session, I learned that
sed -r "<line#>s/^.{n}/<replacement>/g"
limits sed to the first n characters of a line. - I executed
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1"| sed -r "2s/^.{17}/&<minus10box>\n/g" | sed -r "3s/^.{6}/&<\/minus10box>\n/g"
(took me some time to realize that the -r was necessary for this command to work!). In making up this command, I decided to add a break after finding the minus35box and I limited sed to the first match (since I realized that the first match for the minus35box is, in fact, the correct one). Since the minus10box is 17 bp away from the minus35box, I decided to add the <minus10box> part of the tag after those 17 characters of the second line. I then decided to create another line-break after placing the first part of the minus10box tag; on the third line, I exploited the fact that the minus10box is 6 bp long in order to create a sed command that would add the last part of the tag, </minus10box>, after the actual minus10box.
Output so far
Finding and tagging the tss and the ribosome binding site
- From looking over the Week 4 Assignment Page, I learned that the tss is located 12 bp away from the start of the minus10box. I also learned that the ribosome binding site will have a sequence of
gagg
. - I decided to first find the tss, since it is located just after the minus10box. I would have to create a sed command that starts its search from the fourth line (since the base that corresponds to the tss will be somewhere on that line). I came up with
sed -r "4s/^.{5}/&<tss>\n/g" | sed -r "5s/^.{1}/&<\/tss>\n/g"
, which are added to the set of the sed commands that I created in the previous section.