Bklein7 Week 3

From LMU BioDB 2015
Revision as of 00:13, 21 September 2015 by Bklein7 (Talk | contribs) (Added the section on XMLPipeDB Match practice)

Jump to: navigation, search

The Genetic Code, by Computer

Complement of a Strand

Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. In other words, fill in the question marks:

   cat sequence_file | sed "y/atcg/tagc/"

Reading Frames

Write 6 sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence.

Outputs generated using ~dondi/xmlpipedb/data/prokaryote.txt:

  • +1 Reading Frame
   cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
   Output: STIFQ-VRWPKKTILNLKRCLIPCSAYNPAASSAGGIL
  • +2 Reading Frame
   cat sequence_file | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
   Output: LLYFNRYDGQRRQY-T-NVA-YHVPRITQPPVPLAAF-
  • +3 Reading Frame
   cat sequence_file | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
   Output: YYISIGTMAKEDNIELETLPNTMFRV-PSRQFRWRHFN
  • -1 Reading Frame
   cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" |
   sed "s/ //g"
   Output: VKMPPAELAAGLYAEHGIRQRFKENIVFFGHRTY-NIV
  • -2 Reading Frame
   cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | 
   sed "s/[atcg]//g" | sed "s/ //g"
   Output: LKCRQRNWRLGYTRNMVLGNVSSSILSSLAIVPIEI--
  • -3 Reading Frame
   cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed |
   sed "s/[atcg]//g" | sed "s/ //g"
   Output: -NAASGTGGWVIRGTWY-ATFQVQYCLLWPSYLLKYSR

Check Your Work

The ExPASy Translate Tool was used to confirm that the output sequences generated above were accurate translations. Below is the output generated by ExPASy for the same nucleotide sequence adapted from prokaryote.txt.

ExPASy Prokaryote.txt Translation.png

XMLPipeDB Match Practice

To begin this assignment, I accessed the directory ~dondi/xmlpipedb/data.

  1. What Match command tallies the occurrences of the pattern GO:000[567] in the 493.P_falciparum.xml file?
    • The necessary command and it's output are listed below:
      • java -jar xmlpipedb-match-1.1.1.jar GO:000[567] <493.P_falciparum.xml
      • go:0007: 113
      • go:0006: 1100
      • go:0005: 1371
      • Total unique matches: 3
    • How many unique matches are there?
      • Referencing the output above, there are 3 unique matches.
    • How many times does each unique match appear?
      • The unique match go:0005 appeared 1371 times. The unique match go:0006 appeared 1100 times. The unique match go:0007 appeared 113 times.
  2. Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
    • In order to find occurrences of the patterns matched above within their original positions in the file 493.P_falciparum.xml, we can use the grep command. This command will search for a specific keyword and provide "in situ" occurrence as an output.
    • The command I used to do this and it's initial outputs are listed below:
      • grep "GO:0005" 493.P_falciparum.xml
      • <dbReference type="GO" id="GO:0005884">
      • <dbReference type="GO" id="GO:0005737">
    • Based on where you find this occurrence, what kind of information does this pattern represent?
      • This pattern appears to represent an id number under the general category of "GO". A quick skim of the file using the more function reveals many nucleotide sequences and references to genes. When accompanied with a quick google search, I deduced that this could stand for "Gene Oncology" id's-unique identifiers for the genes. This was confirmed when referencing the wiki page on Using the XMLPipeDB Match Utility.
  3. What Match command tallies the occurrences of the pattern \"Yu.*\" in the 493.P_falciparum.xml file?
    • The necessary command and it's output are listed below:
      • java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\" <493.P_falciparum.xml
      • "yu b.": 1
      • "yu k.": 228
      • "yu m.": 1
      • Total unique matches: 3
    • How many unique matches are there?
      • Referencing the above output, there are 3 unique matches.
    • How many times does each unique match appear?
      • The unique match "yu b." occurred 1 time. The unique match "yu k." occurred 228 times. The unique match "yu m." occurred 1 time.
    • What information do you think this pattern represents?
      • This information appears to represent the names of different article authors as would be referenced in an APA citation (last name, abbreviated first name).
      • This deduction was confirmed using the command grep "Yu.*" 493.P_falciparum.xml, which yielded outputs such as <person name="Yu K."/>.
  4. Use Match to count the occurrences of the pattern ATG in the hs_ref_GRCh37_chr19.fa file (this may take a while). Then, use grep and wc to do the same thing.
    • What answer does Match give you?
      • The command java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa yielded the output atg:830101. Thus, Match identified the nucleotide sequence atg 830101 unique times in the above file.
    • What answer does grep + wc give you?
      • The command grep "ATG" hs_ref_GRCh37_chr19.fa | wc yielded the output 502410 502410 35671048. The first and second output values, representing line # and word # respectively, both suggest that the sequence atg exists 502410 times in the file. This is less than was yielded by the XMLPipeDB Match Utility. As an aside, it seems strange that the line and word numbers are exactly the same, as this sequence could potentially exists more than one time per line. Particularly when it occurs so many times in the file.
    • Explain why the counts are different.
      • Piggybacking on the aside above, I decided to see if the sequence atg existed more than once in a single line in the grep output. Therefore, I used the command sequence grep "ATG" hs_ref_GRCh37_chr19.fa | more to view a truncated version of the grep output. In the very first output line, I identified two occurrences of the sequence "atg": GGGACAGGCCCTATG CTGCCACCTGTACATGCTATCTGAAGGACAGCCTCCAGGGCACACAGAGGATGGT. Therefore, it becomes apparent that the pipe command linking together grep and wc was merely counting the number of lines in which the sequence atg appeared and not the unique number of times the pattern atg was present in the file. Conversely, the XMLPipeDB Match Utility is counting the number of times the three character pattern itself was present in the file.