Jkuroda Week 3

For this initial reading frame, I first thought of replacing the t's with u's, then I simply used the genetic-code.sed file to do the rest for me. But this did not work, since the sed command was going through the commands line by line. I was left with a messy line of lonely nucleotides with the amino acid abbreviations between them. I thought about it for a second and realized that I could solve this issue by simply separating each base triplet with a space. That seemed to solve the problem but then there were a couple of stray nucleotides at the end of the line, so I added a sed command to get rid of any extra nucleotides.

+1
cat sequence file | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"

Now that we are in a different reading frame, the overall process is mainly similar, but there is one small addition. I used the sed command to delete the first character in the sequence file.

+2
cat sequence file | sed "s/^.//g" | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"

Similarly for this reading frame, I just added an extra character to be deleted from the beginning of the sequence.

+3
cat sequence file | sed "s/^..//g" | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"

For the next three reading frames, I remembered that there was a handy rev command for reversing the characters in a sequence, so I placed that command before I did the usual sequence of commands.

-1
cat sequence file | rev sequence file | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"

Now that the reverse command is in place, the rest of the commands are similar to the previous reading frames, with the deletion of the first and second characters for -2 and -3, respectively.

-2
cat sequence file | rev sequence file | sed "s/^.//g" | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | 
    sed "s/[aucg]//g"

-3
cat sequence file | rev sequence file | sed "s/^..//g" | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | 
    sed "s/[aucg]//g"

Before tackling this next set of questions, I read the wiki page on using the XMLPipeDB Match Utility so that I was doing the assignment correctly.

1.

java -jar xmlpipedb-match-1.1.1.jar "GO:000[567]" <  493.P_falciparum.xml

There are three unique matches.

go:0005 - 1371 occurrences
go:0006 - 1100 occurrences
go:0007 - 113 occurrences

I spent a while trying to find an occurrence of "in situ," but I was unable to find it on my first attempt, so I skipped this problem. Later on I realized that "in situ" was a fancy way of telling us to find an original occurrence in the file, so I looked up an arbitrary one, namely GO:0007, using grep.

2.

I found original occurrences of GO:0007 using the grep command, and found that "dbReference" was a common prefix to this information. This made me think that these patterns most likely represent locations in some kind of database.

I simply entered the pattern that needed to be found, and I messed around with the pattern to find out what this information represented.

3.

java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\"  <  493.P_falciparum.xml

There are three unique matches.

"yu b." - 1 occurrence
"yu m." - 1 occurrence
"yu k." - 228 occurrences

I think that this pattern represents the names of people. I used this command to come to this conclusion:

java -jar xmlpipedb-match-1.1.1.jar "............."Yu.*""  <  493.P_falciparum.xml

I was able to utilize what I have learned with the Match utility as well as the grep + wc command to get the solution for this problem. I had to recall what the three numbers meant when I used grep + wc, but once I did, I was able to explain why the results were different with each command.

4.

After executing the command to find the pattern ATG, Match told me that there was one unique match with 830101 occurrences.
Using grep + wc, I got back 502410 lines, 502410 words, and 35671048 characters.
Because grep + wc can only count the number of lines on which the occurrence appears, we get a lower number of "occurrences" when using grep + wc. Using the Match utility gives us a more accurate number of 830101 because it is able to look at each occurrence, even when they appear more than once on a line.

Josh Kuroda's page