Jkuroda Week 3
Contents
Complement of a Strand
To get the complement, I immediately thought of replacing each nucleotide with its complement, and so I used sed with "y/atcg/tagc/" to implement my idea.
cat sequence file | sed "y/atcg/tagc/"
Reading Frames
For this initial reading frame, I first thought of replacing the t's with u's, then I simply used the genetic-code.sed file to do the rest for me. But this did not work, since the sed command was going through the commands line by line. I was left with a messy line of lonely nucleotides with the amino acid abbreviations between them. I thought about it for a second and realized that I could solve this issue by simply separating each base triplet with a space. That seemed to solve the problem but then there were a couple of stray nucleotides at the end of the line, so I added a sed command to get rid of any extra nucleotides.
+1 cat sequence file | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"
Now that we are in a different reading frame, the overall process is mainly similar, but there is one small addition. I used the sed command to delete the first character in the sequence file.
+2 cat sequence file | sed "s/^.//g" | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"
Similarly for this reading frame, I just added an extra character to be deleted from the beginning of the sequence.
+3 cat sequence file | sed "s/^..//g" | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"
For the next three reading frames, I remembered that there was a handy rev command for reversing the characters in a sequence, so I placed that command before I did the usual sequence of commands.
-1 cat sequence file | rev sequence file | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"
Now that the reverse command is in place, the rest of the commands are similar to the previous reading frames, with the deletion of the first and second characters for -2 and -3, respectively.
-2 cat sequence file | rev sequence file | sed "s/^.//g" | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"
-3 cat sequence file | rev sequence file | sed "s/^..//g" | sed "s/t/u/g" | sed "s/.../& /g" |sed -f genetic-code.sed | sed "s/[aucg]//g"
Before tackling this next set of questions, I read the wiki page on using the XMLPipeDB Match Utility so that I was doing the assignment correctly.
1.
java -jar xmlpipedb-match-1.1.1.jar "GO:000[567]" < 493.P_falciparum.xml
- There are three unique matches.
- go:0005 - 1371 occurrences
- go:0006 - 1100 occurrences
- go:0007 - 113 occurrences
I spent a while trying to find an occurrence of "in situ," but I was unable to find it on my first attempt, so I skipped this problem. Later on I realized that "in situ" was a fancy way of telling us to find an original occurrence in the file, so I looked up an arbitrary one, namely GO:0007, using grep.
2.
I found original occurrences of GO:0007 using the grep command, and found that "dbReference" was a common prefix to this information. This made me think that these patterns most likely represent locations in some kind of database.
I simply entered the pattern that needed to be found, and I messed around with the pattern to find out what this information represented.
3.
java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\" < 493.P_falciparum.xml
- There are three unique matches.
- "yu b." - 1 occurrence
- "yu m." - 1 occurrence
- "yu k." - 228 occurrences
- I think that this pattern represents the names of people. I used this command to come to this conclusion:
java -jar xmlpipedb-match-1.1.1.jar "............."Yu.*"" < 493.P_falciparum.xml
I was able to utilize what I have learned with the Match utility as well as the grep + wc command to get the solution for this problem. I had to recall what the three numbers meant when I used grep + wc, but once I did, I was able to explain why the results were different with each command.
4.
- After executing the command to find the pattern
ATG
, Match told me that there was one unique match with 830101 occurrences. - Using grep + wc, I got back 502410 lines, 502410 words, and 35671048 characters.
- Because grep + wc can only count the number of lines on which the occurrence appears, we get a lower number of "occurrences" when using grep + wc. Using the Match utility gives us a more accurate number of 830101 because it is able to look at each occurrence, even when they appear more than once on a line.
- After executing the command to find the pattern
Individual Journal Entries
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- Week 13
- Week 14
- Week 15