Bklein7 Week 3

The Genetic Code, by Computer

Complement of a Strand

Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand.

I begin with a cat command to view a text file with a DNA sequence
After the pipe character, I linked together a sed command that invokes a letter by letter replacement of A, T, C, and G with their complementary bases (T, A, G, and C in that order).

cat sequence_file | sed "y/atcg/tagc/"

Reading Frames

Building the Command Sequences

Write 6 sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence.

To tackle this problem, I started by crafting the simplest sequence of commands that carries out the central dogma by translating DNA sequences to amino acid sequences. This command sequence served as the "backbone" from which all 6 reading frame sequences were built. Incidentally, this code also aligned with that necessary to transcribe the +1 reading frame.
1. I began with a cat command to view a text file with a DNA sequence
  - cat sequence_file
2. I added a sed command that represented the transcription of DNA to RNA, replacing "T" nucleotides with "U" nucleotides
  - cat sequence_file | sed "s/t/u/g"
3. I added a sed command that represented the (+1) reading frame for the codons and inserted a space after every three characters
  - cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g"
4. I added a sed command to translate each codon to its associated amino acid. For this exercise, the simplest way I found to do this was to read from the rules file ~dondi/xmlpipedb/data/genetic-code.sed
  - cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed
5. I added a sed command to remove lingering nucleotides at the beginning and/or end of the output that were not translated due to not being present in a triplet in the present reading frame. This prevented confusing these untranslated nucleotides with amino acids
  - cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "[atcg]//g"
6. Finally, I removed spaces from in between the separate amino acid designations to condense the output (has both aesthetic and practical purposes). This yielded the final command sequence:
  - cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"

+1 Reading Frame
- The command sequence above already processed the genetic code within the +1 reading frame:

cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"

+2 Reading Frame
- I simply altered the +1 reading frame code to include a sed command for the deletion of the first character of the text file (i.e. nucleotide) prior to division into triplets:

cat sequence_file | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"

+3 Reading Frame
- I altered the +1 reading frame code to include a sed command for the deletion of the first two characters of the text file (i.e. nucleotides) prior to division into triplets:

cat sequence_file | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"

-1 Reading Frame
- The "backbone" command sequence from above had to be slightly altered for the negative reading frames to account for the reading of a complimentary, antiparallel strand of DNA
  1. I began with the code from the +1 reading frame
    - cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
  2. I added the sed command from the "Compliment of a Strand" section above at the beginning of this sequence. This yielded the DNA sequence complimentary to that which we were originally working with
    - cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
  3. I added a rev command after the code had been transcribed to the complimentary mRNA. This accounts for the fact that the genetic code is read in the 5' to 3' direction. With this final change, the command sequence translated DNA sequences within the -1 reading frame:

cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" |
sed "s/ //g"

-2 Reading Frame
- As in the positive reading frames, I altered the -1 reading frame code to include a sed command for the deletion of the first character of the text file (i.e. nucleotide) prior to division into triplets:

cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed |
sed "s/[atcg]//g" | sed "s/ //g"

-3 Reading Frame
- I altered the -1 reading frame code to include a sed command for the deletion of the first two characters of the text file (i.e. nucleotides) prior to division into triplets:

cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed |
sed "s/[atcg]//g" | sed "s/ //g"

Testing the Command Sequences

Prior to any testing any commands, I entered the directory ~dondi/xmlpipedb/data to access the files present there.

+1 Reading Frame

   cat prokaryote.txt | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
   Output: STIFQ-VRWPKKTILNLKRCLIPCSAYNPAASSAGGIL

+2 Reading Frame

   cat prokaryote.txt | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
   Output: LLYFNRYDGQRRQY-T-NVA-YHVPRITQPPVPLAAF-

+3 Reading Frame

   cat prokaryote.txt | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
   Output: YYISIGTMAKEDNIELETLPNTMFRV-PSRQFRWRHFN

-1 Reading Frame

   cat prokaryote.txt | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" |
   sed "s/ //g"
   Output: VKMPPAELAAGLYAEHGIRQRFKENIVFFGHRTY-NIV

-2 Reading Frame

   cat prokaryote.txt | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | 
   sed "s/[atcg]//g" | sed "s/ //g"
   Output: LKCRQRNWRLGYTRNMVLGNVSSSILSSLAIVPIEI--

-3 Reading Frame

   cat prokaryote.txt | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed |
   sed "s/[atcg]//g" | sed "s/ //g"
   Output: -NAASGTGGWVIRGTWY-ATFQVQYCLLWPSYLLKYSR

Check Your Work

The ExPASy Translate Tool was used to confirm that the output sequences generated above were accurate translations. Below is the output generated by ExPASy for the same nucleotide sequence adapted from prokaryote.txt.

XMLPipeDB Match Practice

To begin this assignment, I accessed the directory ~dondi/xmlpipedb/data.

What Match command tallies the occurrences of the pattern GO:000[567] in the 493.P_falciparum.xml file?
- The necessary command and it's output are listed below:
  - java -jar xmlpipedb-match-1.1.1.jar GO:000[567] <493.P_falciparum.xml
  - go:0007: 113
  - go:0006: 1100
  - go:0005: 1371
  - Total unique matches: 3
- How many unique matches are there?
  - Referencing the output above, there are 3 unique matches.
- How many times does each unique match appear?
  - The unique match go:0005 appeared 1371 times. The unique match go:0006 appeared 1100 times. The unique match go:0007 appeared 113 times.
Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
- In order to find occurrences of the patterns matched above within their original positions in the file 493.P_falciparum.xml, I used the grep command. This command searches for a specific keyword and provides "in situ" occurrence as an output.
- The command I used to do this and it's initial outputs are listed below:
  - grep "GO:0005" 493.P_falciparum.xml
  - <dbReference type="GO" id="GO:0005884">
  - <dbReference type="GO" id="GO:0005737">
- Based on where you find this occurrence, what kind of information does this pattern represent?
  - This pattern appears to represent an id number under the general category of "GO". A quick skim of the file using the more function reveals many nucleotide sequences and references to genes. When accompanied with a quick google search, I deduced that this could stand for "Gene Oncology" id's-unique identifiers for the genes. This was confirmed when referencing the wiki page on Using the XMLPipeDB Match Utility.
What Match command tallies the occurrences of the pattern \"Yu.*\" in the 493.P_falciparum.xml file?
- The necessary command and it's output are listed below:
  - java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\" <493.P_falciparum.xml
  - "yu b.": 1
  - "yu k.": 228
  - "yu m.": 1
  - Total unique matches: 3
- How many unique matches are there?
  - Referencing the above output, there are 3 unique matches.
- How many times does each unique match appear?
  - The unique match "yu b." occurred 1 time. The unique match "yu k." occurred 228 times. The unique match "yu m." occurred 1 time.
- What information do you think this pattern represents?
  - This information appears to represent the names of different article authors as would be referenced in an APA citation (last name, abbreviated first name).
  - This deduction was confirmed using the command grep "Yu.*" 493.P_falciparum.xml, which yielded outputs such as <person name="Yu K."/>.
Use Match to count the occurrences of the pattern ATG in the hs_ref_GRCh37_chr19.fa file (this may take a while). Then, use grep and wc to do the same thing.
- What answer does Match give you?
  - The command java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa yielded the output atg:830101. Thus, Match identified the nucleotide sequence atg 830101 unique times in the above file.
- What answer does grep + wc give you?
  - The command grep "ATG" hs_ref_GRCh37_chr19.fa | wc yielded the output 502410 502410 35671048. The first and second output values, representing line # and word # respectively, both suggest that the sequence atg exists 502410 times in the file. This is less than was yielded by the XMLPipeDB Match Utility. As an aside, it seems strange that the line and word numbers are exactly the same, as this sequence could potentially exists more than one time per line. Particularly when it occurs so many times in the file.
- Explain why the counts are different.
  - Piggybacking on the aside above, I decided to see if the sequence atg existed more than once in a single line in the grep output. Therefore, I used the command sequence grep "ATG" hs_ref_GRCh37_chr19.fa | more to view a truncated version of the grep output. In the very first output line, I identified two occurrences of the sequence "atg": GGGACAGGCCCTATG CTGCCACCTGTACATGCTATCTGAAGGACAGCCTCCAGGGCACACAGAGGATGGT. Therefore, it becomes apparent that the pipe command linking together grep and wc was merely counting the number of lines in which the sequence atg appeared and not the unique number of times the pattern atg was present in the file. Conversely, the XMLPipeDB Match Utility is counting the number of times the three character pattern itself was present in the file.

Links

User Page: Brandon Klein
Team Page: The Class Whoopers

Assignments Pages

Individual Journal Entries

Shared Journal Entries

Bklein7 Week 3

Contents

The Genetic Code, by Computer

Complement of a Strand

Reading Frames

Building the Command Sequences

Testing the Command Sequences

Check Your Work

XMLPipeDB Match Practice

Links

Assignments Pages

Individual Journal Entries

Shared Journal Entries

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools