Blitvak Week 4

Individual Journal Assignment Week 4

Finding and tagging the minus35box and the minus10box of the promoter

I first found infA-E.coli-K12.txt by entering the correct directory, cd ~dondi/xmlpipedb/data.
I copied the sequence kept in that file on an external space for future reference/checking.
I assumed that the sequence is the mRNA-like strand and that it runs from 5'- 3'.
By reading the Week 4 Assignment Page, I found that the -10 box is generally [ct]at[at]at, and that the -35 box is generally tt[gt]ac[at]
I skimmed over the More Text Processing Features page, and I found that sed "s/Title/<h1>&<\/h1>/g" results in an output of <h1>Title</h1>; this command would be useful in tagging the sequence with its various parts.
I then tried cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>/g"| sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>/g".
- Gave me the output:

ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt<minus10box>tataat</minus10box>tgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggag
taatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>ta<minus35box>tttaca</minus35box>gaacttcgg<minus10box>cattat</minus10box>cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgc
aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaa
gagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat

I realized that there are two possibilities for the minus35box and one for the minus10box (since the minus10box must come after the minus35box, the first instance of a "minus10box" is to be ignored).
I looked at the Week 4 Assignment Page, and I found that there is an ideal number of 17 base pairs between the -35 and -10 box. Only <minus35box>tttact</minus35box> fits this criteria (is 17 bp away from <minus10box>cattat</minus10box>).
In the More Text Processing Features page, I found that sed "s/paragraph/&\n/g" results in a line-break right after the pattern gets matched/replaced (this would be useful in making the text more manageable for sed, as it is line centric). From referencing that same page, I found that replacing the g in the s///g format with a number, n, will result in sed replacing only the nth match on that line. Some of these sequences, that correspond to specific parts of the DNA strand, result in multiple matches; it would be fairly useful to limit sed to one replacement by making it only work on the nth match.
I realized that adding a number before the s in the s///g format will limit sed to that line (ex. executing 2s///g, results in a replacement being made on the 2nd line). It should be useful to make several line-breaks in order to make any matching easier; later, these line-breaks will have to be removed.
In an in-class work session, I learned that sed -r "<line#>s/^.{n}/<replacement>/g" limits sed to the first n characters of a line.
I executed cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1"| sed -r "2s/^.{17}/&<minus10box>\n/g" | sed -r "3s/^.{6}/&<\/minus10box>\n/g" (took me some time to realize that the -r was necessary for this command to work!). In making up this command, I decided to add a break after finding the minus35box and I limited sed to the first match (since I realized that the first match for the minus35box is, in fact, the correct one). Since the minus10box is 17 bp away from the minus35box, I decided to add the <minus10box> part of the tag after those 17 characters of the second line. I then decided to create another line-break after placing the first part of the minus10box tag; on the third line, I exploited the fact that the minus10box is 6 bp long in order to create a sed command that would add the last part of the tag, </minus10box>, after the actual minus10box.

Output so far

Finding and tagging the tss and the ribosome binding site

From looking over the Week 4 Assignment Page, I learned that the tss is located 12 bp away from the start of the minus10box. I also learned that the ribosome binding site will have a sequence of gagg.
I decided to first find the tss, since it is located just after the minus10box. I would have to create a sed command that starts its search from the fourth line (since the base that corresponds to the tss will be somewhere on that line). I came up with sed -r "4s/^.{5}/&<tss>\n/g" | sed -r "5s/^.{1}/&<\/tss>\n/g", which are added to the set of the sed commands that I created in the previous section.

Blitvak Week 4

Contents

Individual Journal Assignment Week 4

Finding and tagging the minus35box and the minus10box of the promoter

Output so far

Finding and tagging the tss and the ribosome binding site

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools