Bklein7 Week 3
From LMU BioDB 2015
Contents
The Genetic Code, by Computer
Complement of a Strand
Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand.
- I begin with a cat command to view a text file with a DNA sequence
- After the pipe character, I linked together a sed command that invokes a letter by letter replacement of A, T, C, and G with their complementary bases (T, A, G, and C in that order).
cat sequence_file | sed "y/atcg/tagc/"
Reading Frames
Building the Command Sequences
Write 6 sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence.
- To tackle this problem, I started by crafting the simplest sequence of commands that carries out the central dogma by translating DNA sequences to amino acid sequences. This command sequence served as the "backbone" from which all 6 reading frame sequences were built. Incidentally, this code also aligned with that necessary to transcribe the +1 reading frame.
- I began with a cat command to view a text file with a DNA sequence
- cat sequence_file
- I added a sed command that represented the transcription of DNA to RNA, replacing "T" nucleotides with "U" nucleotides
- cat sequence_file | sed "s/t/u/g"
- I added a sed command that represented the (+1) reading frame for the codons and inserted a space after every three characters
- cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g"
- I added a sed command to translate each codon to its associated amino acid. For this exercise, the simplest way I found to do this was to read from the rules file ~dondi/xmlpipedb/data/genetic-code.sed
- cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed
- I added a sed command to remove lingering nucleotides at the beginning and/or end of the output that were not translated due to not being present in a triplet in the present reading frame. This prevented confusing these untranslated nucleotides with amino acids
- cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "[atcg]//g"
- Finally, I removed spaces from in between the separate amino acid designations to condense the output (has both aesthetic and practical purposes). This yielded the final command sequence:
- cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- I began with a cat command to view a text file with a DNA sequence
- +1 Reading Frame
- The command sequence above already processed the genetic code within the +1 reading frame:
cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- +2 Reading Frame
- I simply altered the +1 reading frame code to include a sed command for the deletion of the first character of the text file (i.e. nucleotide) prior to division into triplets:
cat sequence_file | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- +3 Reading Frame
- I altered the +1 reading frame code to include a sed command for the deletion of the first two characters of the text file (i.e. nucleotides) prior to division into triplets:
cat sequence_file | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- -1 Reading Frame
- The "backbone" command sequence from above had to be slightly altered for the negative reading frames to account for the reading of a complimentary, antiparallel strand of DNA
- I began with the code from the +1 reading frame
- cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- I added the sed command from the "Compliment of a Strand" section above at the beginning of this sequence. This yielded the DNA sequence complimentary to that which we were originally working with
- cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- I added a rev command after the code had been transcribed to the complimentary mRNA. This accounts for the fact that the genetic code is read in the 5' to 3' direction. With this final change, the command sequence translated DNA sequences within the -1 reading frame:
- I began with the code from the +1 reading frame
- The "backbone" command sequence from above had to be slightly altered for the negative reading frames to account for the reading of a complimentary, antiparallel strand of DNA
cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- -2 Reading Frame
- As in the positive reading frames, I altered the -1 reading frame code to include a sed command for the deletion of the first character of the text file (i.e. nucleotide) prior to division into triplets:
cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- -3 Reading Frame
- I altered the -1 reading frame code to include a sed command for the deletion of the first two characters of the text file (i.e. nucleotides) prior to division into triplets:
cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
Testing the Command Sequences
Prior to any testing any commands, I entered the directory ~dondi/xmlpipedb/data to access the files present there.
- +1 Reading Frame
cat prokaryote.txt | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" Output: STIFQ-VRWPKKTILNLKRCLIPCSAYNPAASSAGGIL
- +2 Reading Frame
cat prokaryote.txt | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" Output: LLYFNRYDGQRRQY-T-NVA-YHVPRITQPPVPLAAF-
- +3 Reading Frame
cat prokaryote.txt | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" Output: YYISIGTMAKEDNIELETLPNTMFRV-PSRQFRWRHFN
- -1 Reading Frame
cat prokaryote.txt | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" Output: VKMPPAELAAGLYAEHGIRQRFKENIVFFGHRTY-NIV
- -2 Reading Frame
cat prokaryote.txt | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" Output: LKCRQRNWRLGYTRNMVLGNVSSSILSSLAIVPIEI--
- -3 Reading Frame
cat prokaryote.txt | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" Output: -NAASGTGGWVIRGTWY-ATFQVQYCLLWPSYLLKYSR
Check Your Work
The ExPASy Translate Tool was used to confirm that the output sequences generated above were accurate translations. Below is the output generated by ExPASy for the same nucleotide sequence adapted from prokaryote.txt.
XMLPipeDB Match Practice
To begin this assignment, I accessed the directory ~dondi/xmlpipedb/data.
- What Match command tallies the occurrences of the pattern
GO:000[567]
in the 493.P_falciparum.xml file?- The necessary command and it's output are listed below:
java -jar xmlpipedb-match-1.1.1.jar GO:000[567] <493.P_falciparum.xml
go:0007: 113
go:0006: 1100
go:0005: 1371
Total unique matches: 3
- How many unique matches are there?
- Referencing the output above, there are 3 unique matches.
- How many times does each unique match appear?
- The unique match
go:0005
appeared 1371 times. The unique matchgo:0006
appeared 1100 times. The unique matchgo:0007
appeared 113 times.
- The unique match
- The necessary command and it's output are listed below:
- Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
- In order to find occurrences of the patterns matched above within their original positions in the file 493.P_falciparum.xml, I used the grep command. This command searches for a specific keyword and provides "in situ" occurrence as an output.
- The command I used to do this and it's initial outputs are listed below:
grep "GO:0005" 493.P_falciparum.xml
<dbReference type="GO" id="GO:0005884">
<dbReference type="GO" id="GO:0005737">
- Based on where you find this occurrence, what kind of information does this pattern represent?
- This pattern appears to represent an id number under the general category of "GO". A quick skim of the file using the more function reveals many nucleotide sequences and references to genes. When accompanied with a quick google search, I deduced that this could stand for "Gene Oncology" id's-unique identifiers for the genes. This was confirmed when referencing the wiki page on Using the XMLPipeDB Match Utility.
- What Match command tallies the occurrences of the pattern
\"Yu.*\"
in the 493.P_falciparum.xml file?- The necessary command and it's output are listed below:
java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\" <493.P_falciparum.xml
"yu b.": 1
"yu k.": 228
"yu m.": 1
Total unique matches: 3
- How many unique matches are there?
- Referencing the above output, there are 3 unique matches.
- How many times does each unique match appear?
- The unique match
"yu b."
occurred 1 time. The unique match"yu k."
occurred 228 times. The unique match"yu m."
occurred 1 time.
- The unique match
- What information do you think this pattern represents?
- This information appears to represent the names of different article authors as would be referenced in an APA citation (last name, abbreviated first name).
- This deduction was confirmed using the command
grep "Yu.*" 493.P_falciparum.xml
, which yielded outputs such as<person name="Yu K."/>
.
- The necessary command and it's output are listed below:
- Use Match to count the occurrences of the pattern
ATG
in the hs_ref_GRCh37_chr19.fa file (this may take a while). Then, use grep and wc to do the same thing.- What answer does Match give you?
- The command
java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa
yielded the outputatg:830101
. Thus, Match identified the nucleotide sequence atg 830101 unique times in the above file.
- The command
- What answer does grep + wc give you?
- The command
grep "ATG" hs_ref_GRCh37_chr19.fa | wc
yielded the output502410 502410 35671048
. The first and second output values, representing line # and word # respectively, both suggest that the sequence atg exists 502410 times in the file. This is less than was yielded by the XMLPipeDB Match Utility. As an aside, it seems strange that the line and word numbers are exactly the same, as this sequence could potentially exists more than one time per line. Particularly when it occurs so many times in the file.
- The command
- Explain why the counts are different.
- Piggybacking on the aside above, I decided to see if the sequence atg existed more than once in a single line in the grep output. Therefore, I used the command sequence
grep "ATG" hs_ref_GRCh37_chr19.fa | more
to view a truncated version of the grep output. In the very first output line, I identified two occurrences of the sequence "atg": GGGACAGGCCCTATG CTGCCACCTGTACATGCTATCTGAAGGACAGCCTCCAGGGCACACAGAGGATGGT. Therefore, it becomes apparent that the pipe command linking together grep and wc was merely counting the number of lines in which the sequence atg appeared and not the unique number of times the pattern atg was present in the file. Conversely, the XMLPipeDB Match Utility is counting the number of times the three character pattern itself was present in the file.
- Piggybacking on the aside above, I decided to see if the sequence atg existed more than once in a single line in the grep output. Therefore, I used the command sequence
- What answer does Match give you?
Links
- User Page: Brandon Klein
- Team Page: The Class Whoopers
Assignments Pages
- Week 1 Assignment
- Week 2 Assignment
- Week 3 Assignment
- Week 4 Assignment
- Week 5 Assignment
- Week 6 Assignment
- Week 7 Assignment
- Week 8 Assignment
- Week 9 Assignment
- Week 10 Assignment
- Week 11 Assignment
- Week 12 Assignment
- No Week 13 Assignment
- Week 14 Assignment
- Week 15 Assignment
Individual Journal Entries
- Week 1 Individual Journal
- Week 2 Individual Journal
- Week 3 Individual Journal
- Week 4 Individual Journal
- Week 5 Individual Journal
- Week 6 Individual Journal
- Week 7 Individual Journal
- Week 8 Individual Journal
- Week 9 Individual Journal
- Week 10 Individual Journal
- Week 11 Individual Journal
- Week 12 Individual Journal
- No Week 13 Journal
- Week 14 Individual Journal
- Week 15 Individual Journal
- Week 1 Class Journal
- Week 2 Class Journal
- Week 3 Class Journal
- Week 4 Class Journal
- Week 5 Class Journal
- Week 6 Class Journal
- Week 7 Class Journal
- Week 8 Class Journal
- Week 9 Class Journal
- Week 10 Team Journal
- Week 11 Team Journal
- Week 12 Team Journal
- No Week 13 Journal
- Week 14 Team Journal
- Week 15 Team Journal