Blitvak Week 4
From LMU BioDB 2015
Revision as of 22:42, 27 September 2015 by Blitvak (Talk | contribs) (finished segment regarding tss)
Individual Journal Assignment Week 4
Finding and tagging the minus35box and the minus10box of the promoter
- I first found infA-E.coli-K12.txt by entering the correct directory,
cd ~dondi/xmlpipedb/data
. - I copied the sequence kept in that file on an external space for future reference/checking.
- I assumed that the sequence is the mRNA-like strand and that it runs from 5'- 3'.
- By reading the Week 4 Assignment Page, I found that the -10 box is generally
[ct]at[at]at
, and that the -35 box is generallytt[gt]ac[at]
- I skimmed over the More Text Processing Features page, and I found that
sed "s/Title/<h1>&<\/h1>/g"
results in an output of <h1>Title</h1>; this command would be useful in tagging the sequence with its various parts. - I then tried
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>/g"| sed "s/[ct]at[at]at/<minus10box>&<\/minus10box>/g"
.- Gave me the output:
ttttcaccacaagaatgaatgttttcggcacatttctccccagagtgt<minus10box>tataat</minus10box>tgcggtcgcagagttggttacgctcattaccccgctgccgataaggaatttttcgcgtcaggtaacgcccatcgtttatctcaccgctcccttatacgttgcgcttttggtgcggcttagccgtgtgttttcggag taatgtgccgaacctgtttgttgcgatttagcgcgcaaatc<minus35box>tttact</minus35box>ta<minus35box>tttaca</minus35box>gaacttcgg<minus10box>cattat</minus10box>cttgccggttcaaattacggtagtgataccccagaggattagatggccaaagaagacaatattgaaatgc aaggtaccgttcttgaaacgttgcctaataccatgttccgcgtagagttagaaaacggtcacgtggttactgcacacatctccggtaaaatgcgcaaaaactacatccgcatcctgacgggcgacaaagtgactgttgaactgaccccgtacgacctgagcaaaggccgcattgtcttccgtagtcgctgattgttttaccgcctgatgggcgaa gagaaagaacgagtaaaaggtcggtttaaccggcctttttattttat
- I realized that there are two possibilities for the minus35box and one for the minus10box (since the minus10box must come after the minus35box, the first instance of a "minus10box" is to be ignored).
- I looked at the Week 4 Assignment Page, and I found that there is an ideal number of 17 base pairs between the -35 and -10 box. Only
<minus35box>tttact</minus35box>
fits this criteria (is 17 bp away from<minus10box>cattat</minus10box>
). - In the More Text Processing Features page, I found that
sed "s/paragraph/&\n/g"
results in a line-break right after the pattern gets matched/replaced (this would be useful in making the text more manageable for sed, as it is line centric). From referencing that same page, I found that replacing the g in the s///g format with a number, n, will result in sed replacing only the nth match on that line. Some of these sequences, that correspond to specific parts of the DNA strand, result in multiple matches; it would be fairly useful to limit sed to one replacement by making it only work on the nth match. - I realized that adding a number before the s in the s///g format will limit sed to that line (ex. executing 2s///g, results in a replacement being made on the 2nd line). It should be useful to make several line-breaks in order to make any matching easier; later, these line-breaks will have to be removed.
- In an in-class work session, I learned that
sed -r "<line#>s/^.{n}/<replacement>/g"
limits sed to the first n characters of a line. - I executed
cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/<minus35box>&<\/minus35box>\n/1"| sed -r "2s/^.{17}/&<minus10box>\n/g" | sed -r "3s/^.{6}/&<\/minus10box>\n/g"
(took me some time to realize that the -r was necessary for this command to work!). In making up this command, I decided to add a break after finding the minus35box and I limited sed to the first match (since I realized that the first match for the minus35box is, in fact, the correct one). Since the minus10box is 17 bp away from the minus35box, I decided to add the <minus10box> part of the tag after those 17 characters of the second line. I then decided to create another line-break after placing the first part of the minus10box tag; on the third line, I exploited the fact that the minus10box is 6 bp long in order to create a sed command that would add the last part of the tag, </minus10box>, after the actual minus10box.
Finding and tagging the tss and the ribosome binding site
- From looking over the Week 4 Assignment Page, I learned that the tss is located 12 bp away from the start of the minus10box. I also learned that the ribosome binding site will have a sequence of
gagg
. - I decided to first find the tss, since it is located just after the minus10box. I would have to create a sed command that starts its search from the fourth line (since the base that corresponds to the tss will be somewhere on that line). I came up with
sed -r "4s/^.{5}/&<tss>\n/g" | sed -r "5s/^.{1}/&<\/tss>\n/g"
, which are added to the set of the sed commands that I created in the previous section. Since the minus10box is 6 bp long, I decided to place the first part of the tss tag after 5 characters of the fourth line; the last part of the tag was placed by adding a line-break after the first part of the tag and then by adding </tss> after the first character of the next line through sed. The output of these commands results in the 12th nucleotide from the beginning of the minus10box being tagged as the transcription start site.