Kzebrows Week 4

From LMU BioDB 2015
Revision as of 02:08, 29 September 2015 by Kzebrows (Talk | contribs) (Transcription start site.)

Jump to: navigation, search

Transcription and Translation "Taken to the Next Level"

To start this assignment I began by opening Terminal on my laptop. I entered

ssh kzebrows@my.cs.lmu.edu 

followed by my password to log into the LMU CMSI database. As I usually do, I entered the following commands in order to enter Dr. Dionisio's directory, list the files in the directory, and choose the appropriate file for this assignment:

~cd dondi/xmlpipedb/data | ls | cat infA-E.coli-K12.txt

This took me to the E.coli file and showed me the nucleotide sequence. To complete this assignment I frequently used this page as a resource.

I began by using grep to find the potential -35 box and -10 box because grep highlights the searched pattern in red. I simply entered

cat infA-E.coli-K12.txt | grep "tt[gt]ac[at]"

which gave me two possible answers for the -35 box, tttact and tttaca, both of which fit the pattern. Now it was a matter of finding out which one was the correct one. I also searched for the -10 box using

cat infA-E.coli-K12.txt | grep "[ct]at[at]at"

which also revealed two potential sites at tataat and cattat. I realized that in order to find out which sequences were the correct ones I needed to visualize them both together, but grep doesn't do this, so instead I used sed. To do this, I entered the sed commands as a pipe, and added three space on either side of each occurrence of the consensus sequences (both -35 and -10) in the file to make the sequences more visible.. This is done by adding sed "s/<pattern>/& /g" where <pattern> is what I wish to find and each space after the "&" sign is what I wished to add to each side of the pattern (instructions found here). The pipe looked like this:

cat infA-E.coli-K12.txt | sed "[ct]at[at]at/   &   /g" | sed "tt[gt]ac[at]/   &   /g"

This made it clear that it was the first -35 box option, tttact, and the second -10 box option, cattat, that I was looking for in this gene. Using this information, it was then much simpler for me to highlight the specific sequences for the assignment.

To highlight the -35 box, I needed to use sed to put <minus35box> on each side of the first option, along with three spaces. To do this, I consulted the Text Processing page of the wiki and found out that to do this I can replace g with the number of the occurrence I wish to change. Because I only needed the first option to be highlighted (tttact), the command looked like this:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1" 

Next, to highlight the -10 box, I did the same thing except my goal was to add <minus10box> to each side of the second -10 box option. The command looked like this:

cat infA-E.coli-K12.txt | sed "s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /2"

Which highlighted the -10 box, cattat.

In order to find the transcription start site, I learned from the assignment page that the site is located at the 12th nucleotide after the first nucleotide of the -10 box. This means that the start of transcription was the sixth codon after cattat. To find this, I broke up the gene and inserted a new line right after the -35 box. In the "picking lines" section of More Text Processing Features, I found that to do this I had to replace sed s///g with sed 2s///g. This command looked like this:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /1"

I noted that it should be /1, not /2, after the -10 box because since I'm only looking at things after the -35 box it would be the first occurrence of [ct]at[at]at.

My next goal was to find a command that would allow me to skip over 5 more nucleotides to the transcription start site <tss>...</tss> on the 6th nucleotide after the -10 box. I did this by adding the command

sed -r "s/<\/minus10box> (.){5}/&\n/g"

Which indicated that I meant to skip over 5 nucleotides (in the curly braces). the -r meant each repetition of the pattern.

This had me starting at the 10th nucleotide, not the 12th. I realized that this was because I had added extra spaces around the <minus10box>...</minus10box>, and the spaces counted as (.). To fix this, I put {7} in curly braces instead of {5}, which gave me a newline at the right nucleotide (the 12th one). Then, to highlight the transcription start site I added

sed "3s/^./<tss>&<\/tss> /g" 

to tell the computer that I wished to add <tss> labels around the first character in the third line. The command looked like this:

cat infA-E.coli-K12.txt | sed "s/tt[gt]ac[at]/   <minus35box>&<\/minus35box>   /1" | sed "s/<\/minus35box>/&\n/g" | sed "2s/[ct]at[at]at/   <minus10box>&<\/minus10box>   /1" | sed -r "s/<\/minus10box> (.){7}/&\n/g" | sed "3s/^./<tss>&<\/tss> /g"