Jkuroda Week 3

From LMU BioDB 2015
Jump to: navigation, search

Complement of a Strand

To get the complement, I immediately thought of replacing each nucleotide with its complement, and so I used sed with "y/atcg/tagc/" to implement my idea.

cat sequence file | sed "y/atcg/tagc/"

Reading Frames

For this initial reading frame, I first thought of replacing the t's with u's, then I simply used the genetic-code.sed file to do the rest for me. But this did not work, since the sed command was going through the commands line by line. I was left with a messy line of lonely nucleotides with the amino acid abbreviations between them. I thought about it for a second and realized that I could solve this issue by simply separating each base triplet with a space. That seemed to solve the problem but then there were a couple of stray nucleotides at the end of the line, so I added a sed command to get rid of any extra nucleotides.

+1
cat sequence file | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"

Now that we are in a different reading frame, the overall process is mainly similar, but there is one small addition. I used the sed command to delete the first character in the sequence file.

+2
cat sequence file | sed "s/^.//g" | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"

Similarly for this reading frame, I just added an extra character to be deleted from the beginning of the sequence.

+3
cat sequence file | sed "s/^..//g" | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"

For the next three reading frames, I remembered that there was a handy rev command for reversing the characters in a sequence, so I placed that command before I did the usual sequence of commands.

-1
cat sequence file | rev sequence file | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"

Now that the reverse command is in place, the rest of the commands are similar to the previous reading frames, with the deletion of the first and second characters for -2 and -3, respectively.

-2
cat sequence file | rev sequence file | sed "s/^.//g" | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | 
    sed "s/[aucg]//g"
-3
cat sequence file | rev sequence file | sed "s/^..//g" | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | 
    sed "s/[aucg]//g"

Before tackling this next set of questions, I read the wiki page on using the XMLPipeDB Match Utility so that I was doing the assignment correctly.

1.

java -jar xmlpipedb-match-1.1.1.jar "GO:000[567]" <  493.P_falciparum.xml
  • There are three unique matches.
  • go:0005 - 1371 occurrences
  • go:0006 - 1100 occurrences
  • go:0007 - 113 occurrences

I spent a while trying to find an occurrence of "in situ," but I was unable to find it on my first attempt, so I skipped this problem. Later on I realized that "in situ" was a fancy way of telling us to find an original occurrence in the file, so I looked up an arbitrary one, namely GO:0007, using grep.

2.

I found original occurrences of GO:0007 using the grep command, and found that "dbReference" was a common prefix to this information. This made me think that these patterns most likely represent locations in some kind of database.

I simply entered the pattern that needed to be found, and I messed around with the pattern to find out what this information represented.

3.

java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\"  <  493.P_falciparum.xml
  • There are three unique matches.
  • "yu b." - 1 occurrence
  • "yu m." - 1 occurrence
  • "yu k." - 228 occurrences
  • I think that this pattern represents the names of people. I used this command to come to this conclusion:
java -jar xmlpipedb-match-1.1.1.jar "............."Yu.*""  <  493.P_falciparum.xml

I was able to utilize what I have learned with the Match utility as well as the grep + wc command to get the solution for this problem. I had to recall what the three numbers meant when I used grep + wc, but once I did, I was able to explain why the results were different with each command.

4.

  • After executing the command to find the pattern ATG, Match told me that there was one unique match with 830101 occurrences.
  • Using grep + wc, I got back 502410 lines, 502410 words, and 35671048 characters.
  • Because grep + wc can only count the number of lines on which the occurrence appears, we get a lower number of "occurrences" when using grep + wc. Using the Match utility gives us a more accurate number of 830101 because it is able to look at each occurrence, even when they appear more than once on a line.

Josh Kuroda's page

Individual Journal Entries

Week 2
Week 3
Week 4
Week 5
Week 6
Week 7
Week 8
Week 9
Week 10
Week 11
Week 12
Week 13
Week 14
Week 15

Shared Journal Entries

Week 1
Week 2
Week 3
Week 4
Week 5
Week 6
Week 7
Week 8
Week 9
Week 10
Week 11
Week 12
Week 13
Week 14
Week 15