Difference between revisions of "Bklein7 Week 3"
From LMU BioDB 2015
(Added a links section and the journal entry category) |
(Merged the electronic notebook entry from my user page with this page) |
||
Line 1: | Line 1: | ||
==The Genetic Code, by Computer== | ==The Genetic Code, by Computer== | ||
===Complement of a Strand=== | ===Complement of a Strand=== | ||
− | + | Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. | |
− | Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. | + | #Begin with a cat command to view a text file with a DNA sequence |
− | + | #After the pipe character, link together a sed command that invokes a letter by letter replacement of A, T, C, and G with their complementary bases (T, A, G, and C in that order). | |
− | + | cat ''sequence_file'' | sed "y/atcg/tagc/" | |
===Reading Frames=== | ===Reading Frames=== | ||
− | + | ====Building the Command Sequences==== | |
Write ''6'' sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence. | Write ''6'' sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence. | ||
+ | *To tackle this problem, start by crafting the simplest sequence of commands that will carry out the central dogma by translating DNA sequences to amino acid sequences. This command sequence will serve as the "backbone" from which all 6 reading frame sequences are built. Incidentally, this code will also align with that necessary to transcribe the ''+1 reading frame''. | ||
+ | *#Begin with a cat command to view a text file with a DNA sequence | ||
+ | *#*cat ''sequence_file'' | ||
+ | *#Add a sed command representing the transcription of DNA to RNA, replacing "T" nucleotides with "U" nucleotides | ||
+ | *#*cat ''sequence_file'' | sed "s/t/u/g" | ||
+ | *#Add a sed command representing the (+1) reading frame for the codons, inserting a space after every three characters | ||
+ | *#*cat ''sequence_file'' | sed "s/t/u/g" | sed "s/.../& /g" | ||
+ | *#Add a sed command to translate each codon to its associated amino acid. For this exercise, the simplest way to do this is to read from the rules file ''~dondi/xmlpipedb/data/genetic-code.sed'' | ||
+ | *#*cat ''sequence_file'' | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | ||
+ | *#Add a sed command to remove lingering nucleotides at the beginning and/or end of the output that were not translated due to not being present in a triplet in the present reading frame. This will prevent confusing these untranslated nucleotides with amino acids | ||
+ | *#*cat ''sequence_file'' | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "[atcg]//g" | ||
+ | *#Finally, remove spaces from in between the separate amino acid designations to condense the output (has both aesthetic and practical purposes). This will yield the final command sequence: | ||
+ | *#*cat ''sequence_file'' | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" | ||
− | |||
− | |||
*'''+1 Reading Frame''' | *'''+1 Reading Frame''' | ||
− | + | **The command sequence above already processes the genetic code within the +1 reading frame: | |
+ | cat ''sequence_file'' | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" | ||
+ | *'''+2 Reading Frame''' | ||
+ | **Simply alter the +1 reading frame code to include a sed command for the deletion of the first character of the text file (i.e. nucleotide) prior to division into triplets: | ||
+ | cat ''sequence_file'' | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" | ||
+ | *'''+3 Reading Frame''' | ||
+ | **Alter the +1 reading frame code to include a sed command for the deletion of the first ''two'' characters of the text file (i.e. nucleotides) prior to division into triplets: | ||
+ | cat ''sequence_file'' | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" | ||
+ | *'''-1 Reading Frame''' | ||
+ | **The "backbone" command sequence from above will have to be slightly altered for the negative reading frames to account for the reading of a complimentary, antiparallel strand of DNA | ||
+ | **#Begin with the code from the +1 reading frame | ||
+ | **#*cat ''sequence_file'' | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" | ||
+ | **#Add the sed command from the "Compliment of a Strand" section above at the beginning of this sequence. This will yield the DNA sequence complimentary to that which we were originally working with | ||
+ | **#*cat ''sequence_file'' | sed "y/atcg/tagc/" | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" | ||
+ | **#Add a rev command after the code has been transcribed to the complimentary mRNA. This accounts for the fact that the genetic code is read in the 5' to 3' direction. With this final change, the command sequence will now translate the DNA sequence within the -1 reading frame: | ||
+ | cat ''sequence_file'' | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | | ||
+ | sed "s/ //g" | ||
+ | *'''-2 Reading Frame''' | ||
+ | **As in the positive reading frames, alter the -1 reading frame code to include a sed command for the deletion of the first character of the text file (i.e. nucleotide) prior to division into triplets: | ||
+ | cat ''sequence_file'' | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | | ||
+ | sed "s/[atcg]//g" | sed "s/ //g" | ||
+ | *'''-3 Reading Frame''' | ||
+ | **Alter the -1 reading frame code to include a sed command for the deletion of the first ''two'' characters of the text file (i.e. nucleotides) prior to division into triplets: | ||
+ | cat ''sequence_file'' | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | | ||
+ | sed "s/[atcg]//g" | sed "s/ //g" | ||
+ | ====Testing the Command Sequences==== | ||
+ | Prior to any testing any commands, I entered the directory ''~dondi/xmlpipedb/data'' to access the files present there. | ||
+ | *'''+1 Reading Frame''' | ||
+ | cat prokaryote.txt | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" | ||
Output: STIFQ-VRWPKKTILNLKRCLIPCSAYNPAASSAGGIL | Output: STIFQ-VRWPKKTILNLKRCLIPCSAYNPAASSAGGIL | ||
*'''+2 Reading Frame''' | *'''+2 Reading Frame''' | ||
− | cat | + | cat prokaryote.txt | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" |
Output: LLYFNRYDGQRRQY-T-NVA-YHVPRITQPPVPLAAF- | Output: LLYFNRYDGQRRQY-T-NVA-YHVPRITQPPVPLAAF- | ||
*'''+3 Reading Frame''' | *'''+3 Reading Frame''' | ||
− | cat | + | cat prokaryote.txt | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" |
Output: YYISIGTMAKEDNIELETLPNTMFRV-PSRQFRWRHFN | Output: YYISIGTMAKEDNIELETLPNTMFRV-PSRQFRWRHFN | ||
*'''-1 Reading Frame''' | *'''-1 Reading Frame''' | ||
− | cat | + | cat prokaryote.txt | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | |
sed "s/ //g" | sed "s/ //g" | ||
Output: VKMPPAELAAGLYAEHGIRQRFKENIVFFGHRTY-NIV | Output: VKMPPAELAAGLYAEHGIRQRFKENIVFFGHRTY-NIV | ||
*'''-2 Reading Frame''' | *'''-2 Reading Frame''' | ||
− | cat | + | cat prokaryote.txt | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | |
sed "s/[atcg]//g" | sed "s/ //g" | sed "s/[atcg]//g" | sed "s/ //g" | ||
Output: LKCRQRNWRLGYTRNMVLGNVSSSILSSLAIVPIEI-- | Output: LKCRQRNWRLGYTRNMVLGNVSSSILSSLAIVPIEI-- | ||
*'''-3 Reading Frame''' | *'''-3 Reading Frame''' | ||
− | cat | + | cat prokaryote.txt | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | |
sed "s/[atcg]//g" | sed "s/ //g" | sed "s/[atcg]//g" | sed "s/ //g" | ||
Output: -NAASGTGGWVIRGTWY-ATFQVQYCLLWPSYLLKYSR | Output: -NAASGTGGWVIRGTWY-ATFQVQYCLLWPSYLLKYSR |
Revision as of 00:33, 21 September 2015
Contents
The Genetic Code, by Computer
Complement of a Strand
Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand.
- Begin with a cat command to view a text file with a DNA sequence
- After the pipe character, link together a sed command that invokes a letter by letter replacement of A, T, C, and G with their complementary bases (T, A, G, and C in that order).
cat sequence_file | sed "y/atcg/tagc/"
Reading Frames
Building the Command Sequences
Write 6 sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence.
- To tackle this problem, start by crafting the simplest sequence of commands that will carry out the central dogma by translating DNA sequences to amino acid sequences. This command sequence will serve as the "backbone" from which all 6 reading frame sequences are built. Incidentally, this code will also align with that necessary to transcribe the +1 reading frame.
- Begin with a cat command to view a text file with a DNA sequence
- cat sequence_file
- Add a sed command representing the transcription of DNA to RNA, replacing "T" nucleotides with "U" nucleotides
- cat sequence_file | sed "s/t/u/g"
- Add a sed command representing the (+1) reading frame for the codons, inserting a space after every three characters
- cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g"
- Add a sed command to translate each codon to its associated amino acid. For this exercise, the simplest way to do this is to read from the rules file ~dondi/xmlpipedb/data/genetic-code.sed
- cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed
- Add a sed command to remove lingering nucleotides at the beginning and/or end of the output that were not translated due to not being present in a triplet in the present reading frame. This will prevent confusing these untranslated nucleotides with amino acids
- cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "[atcg]//g"
- Finally, remove spaces from in between the separate amino acid designations to condense the output (has both aesthetic and practical purposes). This will yield the final command sequence:
- cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- Begin with a cat command to view a text file with a DNA sequence
- +1 Reading Frame
- The command sequence above already processes the genetic code within the +1 reading frame:
cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- +2 Reading Frame
- Simply alter the +1 reading frame code to include a sed command for the deletion of the first character of the text file (i.e. nucleotide) prior to division into triplets:
cat sequence_file | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- +3 Reading Frame
- Alter the +1 reading frame code to include a sed command for the deletion of the first two characters of the text file (i.e. nucleotides) prior to division into triplets:
cat sequence_file | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- -1 Reading Frame
- The "backbone" command sequence from above will have to be slightly altered for the negative reading frames to account for the reading of a complimentary, antiparallel strand of DNA
- Begin with the code from the +1 reading frame
- cat sequence_file | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- Add the sed command from the "Compliment of a Strand" section above at the beginning of this sequence. This will yield the DNA sequence complimentary to that which we were originally working with
- cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- Add a rev command after the code has been transcribed to the complimentary mRNA. This accounts for the fact that the genetic code is read in the 5' to 3' direction. With this final change, the command sequence will now translate the DNA sequence within the -1 reading frame:
- Begin with the code from the +1 reading frame
- The "backbone" command sequence from above will have to be slightly altered for the negative reading frames to account for the reading of a complimentary, antiparallel strand of DNA
cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- -2 Reading Frame
- As in the positive reading frames, alter the -1 reading frame code to include a sed command for the deletion of the first character of the text file (i.e. nucleotide) prior to division into triplets:
cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
- -3 Reading Frame
- Alter the -1 reading frame code to include a sed command for the deletion of the first two characters of the text file (i.e. nucleotides) prior to division into triplets:
cat sequence_file | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g"
Testing the Command Sequences
Prior to any testing any commands, I entered the directory ~dondi/xmlpipedb/data to access the files present there.
- +1 Reading Frame
cat prokaryote.txt | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" Output: STIFQ-VRWPKKTILNLKRCLIPCSAYNPAASSAGGIL
- +2 Reading Frame
cat prokaryote.txt | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" Output: LLYFNRYDGQRRQY-T-NVA-YHVPRITQPPVPLAAF-
- +3 Reading Frame
cat prokaryote.txt | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" Output: YYISIGTMAKEDNIELETLPNTMFRV-PSRQFRWRHFN
- -1 Reading Frame
cat prokaryote.txt | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" Output: VKMPPAELAAGLYAEHGIRQRFKENIVFFGHRTY-NIV
- -2 Reading Frame
cat prokaryote.txt | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" Output: LKCRQRNWRLGYTRNMVLGNVSSSILSSLAIVPIEI--
- -3 Reading Frame
cat prokaryote.txt | sed "y/atcg/tagc/" | sed "s/t/u/g" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[atcg]//g" | sed "s/ //g" Output: -NAASGTGGWVIRGTWY-ATFQVQYCLLWPSYLLKYSR
Check Your Work
The ExPASy Translate Tool was used to confirm that the output sequences generated above were accurate translations. Below is the output generated by ExPASy for the same nucleotide sequence adapted from prokaryote.txt.
XMLPipeDB Match Practice
To begin this assignment, I accessed the directory ~dondi/xmlpipedb/data.
- What Match command tallies the occurrences of the pattern
GO:000[567]
in the 493.P_falciparum.xml file?- The necessary command and it's output are listed below:
java -jar xmlpipedb-match-1.1.1.jar GO:000[567] <493.P_falciparum.xml
go:0007: 113
go:0006: 1100
go:0005: 1371
Total unique matches: 3
- How many unique matches are there?
- Referencing the output above, there are 3 unique matches.
- How many times does each unique match appear?
- The unique match
go:0005
appeared 1371 times. The unique matchgo:0006
appeared 1100 times. The unique matchgo:0007
appeared 113 times.
- The unique match
- The necessary command and it's output are listed below:
- Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
- In order to find occurrences of the patterns matched above within their original positions in the file 493.P_falciparum.xml, we can use the grep command. This command will search for a specific keyword and provide "in situ" occurrence as an output.
- The command I used to do this and it's initial outputs are listed below:
grep "GO:0005" 493.P_falciparum.xml
<dbReference type="GO" id="GO:0005884">
<dbReference type="GO" id="GO:0005737">
- Based on where you find this occurrence, what kind of information does this pattern represent?
- This pattern appears to represent an id number under the general category of "GO". A quick skim of the file using the more function reveals many nucleotide sequences and references to genes. When accompanied with a quick google search, I deduced that this could stand for "Gene Oncology" id's-unique identifiers for the genes. This was confirmed when referencing the wiki page on Using the XMLPipeDB Match Utility.
- What Match command tallies the occurrences of the pattern
\"Yu.*\"
in the 493.P_falciparum.xml file?- The necessary command and it's output are listed below:
java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\" <493.P_falciparum.xml
"yu b.": 1
"yu k.": 228
"yu m.": 1
Total unique matches: 3
- How many unique matches are there?
- Referencing the above output, there are 3 unique matches.
- How many times does each unique match appear?
- The unique match
"yu b."
occurred 1 time. The unique match"yu k."
occurred 228 times. The unique match"yu m."
occurred 1 time.
- The unique match
- What information do you think this pattern represents?
- This information appears to represent the names of different article authors as would be referenced in an APA citation (last name, abbreviated first name).
- This deduction was confirmed using the command
grep "Yu.*" 493.P_falciparum.xml
, which yielded outputs such as<person name="Yu K."/>
.
- The necessary command and it's output are listed below:
- Use Match to count the occurrences of the pattern
ATG
in the hs_ref_GRCh37_chr19.fa file (this may take a while). Then, use grep and wc to do the same thing.- What answer does Match give you?
- The command
java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa
yielded the outputatg:830101
. Thus, Match identified the nucleotide sequence atg 830101 unique times in the above file.
- The command
- What answer does grep + wc give you?
- The command
grep "ATG" hs_ref_GRCh37_chr19.fa | wc
yielded the output502410 502410 35671048
. The first and second output values, representing line # and word # respectively, both suggest that the sequence atg exists 502410 times in the file. This is less than was yielded by the XMLPipeDB Match Utility. As an aside, it seems strange that the line and word numbers are exactly the same, as this sequence could potentially exists more than one time per line. Particularly when it occurs so many times in the file.
- The command
- Explain why the counts are different.
- Piggybacking on the aside above, I decided to see if the sequence atg existed more than once in a single line in the grep output. Therefore, I used the command sequence
grep "ATG" hs_ref_GRCh37_chr19.fa | more
to view a truncated version of the grep output. In the very first output line, I identified two occurrences of the sequence "atg": GGGACAGGCCCTATG CTGCCACCTGTACATGCTATCTGAAGGACAGCCTCCAGGGCACACAGAGGATGGT. Therefore, it becomes apparent that the pipe command linking together grep and wc was merely counting the number of lines in which the sequence atg appeared and not the unique number of times the pattern atg was present in the file. Conversely, the XMLPipeDB Match Utility is counting the number of times the three character pattern itself was present in the file.
- Piggybacking on the aside above, I decided to see if the sequence atg existed more than once in a single line in the grep output. Therefore, I used the command sequence
- What answer does Match give you?
Links
- User Page: Brandon Klein
- Team Page: The Class Whoopers
Assignments Pages
- Week 1 Assignment
- Week 2 Assignment
- Week 3 Assignment
- Week 4 Assignment
- Week 5 Assignment
- Week 6 Assignment
- Week 7 Assignment
- Week 8 Assignment
- Week 9 Assignment
- Week 10 Assignment
- Week 11 Assignment
- Week 12 Assignment
- No Week 13 Assignment
- Week 14 Assignment
- Week 15 Assignment
Individual Journal Entries
- Week 1 Individual Journal
- Week 2 Individual Journal
- Week 3 Individual Journal
- Week 4 Individual Journal
- Week 5 Individual Journal
- Week 6 Individual Journal
- Week 7 Individual Journal
- Week 8 Individual Journal
- Week 9 Individual Journal
- Week 10 Individual Journal
- Week 11 Individual Journal
- Week 12 Individual Journal
- No Week 13 Journal
- Week 14 Individual Journal
- Week 15 Individual Journal
- Week 1 Class Journal
- Week 2 Class Journal
- Week 3 Class Journal
- Week 4 Class Journal
- Week 5 Class Journal
- Week 6 Class Journal
- Week 7 Class Journal
- Week 8 Class Journal
- Week 9 Class Journal
- Week 10 Team Journal
- Week 11 Team Journal
- Week 12 Team Journal
- No Week 13 Journal
- Week 14 Team Journal
- Week 15 Team Journal