Difference between revisions of "Anuvarsh Week 3"
(Copied Week 3 homework assignment) |
m (Added template) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | + | = The Genetic Code, by Computer = | |
− | Connect to the ''my.cs.lmu.edu'' workstation as shown in class and do the following exercises from there. | + | '''Connect to the ''my.cs.lmu.edu'' workstation as shown in class and do the following exercises from there.''' |
− | + | I did so by performing the following command: | |
− | + | ssh avarshne@my.cs.lmu.edu | |
− | Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. In other words, fill in the question marks: | + | and then inputting my password. |
+ | |||
+ | I then created a folder for this class. <!-- found command with a little googling: http://www.computerhope.com/issues/ch000742.htm --> | ||
+ | |||
+ | mkdir biodb2015 | ||
+ | |||
+ | And created a sequence_file.txt file <!-- taking the homework very literally, found the command here: http://unix.stackexchange.com/questions/159672/how-to-create-a-simple-txt-text-file-using-terminal --> | ||
+ | |||
+ | echo 'agcggtatac' >sequence_file.txt | ||
+ | |||
+ | I then moved to Dondi's repository and copied over some files using the following commands: | ||
+ | |||
+ | cd ~dondi/xmlpipedb/data | ||
+ | cp genetic-code.sed ~avarshne/biodb2015 | ||
+ | cp xmlpipedb-match-1.1.1.jar ~avarshne/biodb2015 | ||
+ | cp prokaryote.txt ~avarshne/biodb2015 | ||
+ | cp infA-E.coli-K12.txt ~avarshne/biodb2015 | ||
+ | cp 493.P_falciparum.xml ~avarshne/biodb2015 | ||
+ | cp hs_ref_GRCh37_chr19.fa ~avarshne/biodb2015 | ||
+ | |||
+ | |||
+ | == Complement of a Strand == | ||
+ | |||
+ | '''Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. In other words, fill in the question marks: | ||
cat ''sequence_file'' | '''?????''' | cat ''sequence_file'' | '''?????''' | ||
Line 17: | Line 40: | ||
Then your text processing commands should display: | Then your text processing commands should display: | ||
+ | tcgccatatg''' | ||
+ | |||
+ | In order to do this, I first set out to determine what all needs to be done by the computer consecutively. | ||
+ | # sequence_file.txt must be concatenated in order for any of the next commands to work on the text within that file. | ||
+ | # Replace A, T, C, and G with it's corresponding base pairs. | ||
+ | |||
+ | These steps can be achieved with the following commands, and produces the following result: | ||
+ | |||
+ | cat "sequence_file.txt" | sed "y/atcg/tagc/" | ||
tcgccatatg | tcgccatatg | ||
− | + | == Reading Frames == | |
− | Write ''6'' sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence. In other words, fill in the question marks: | + | '''Write ''6'' sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence. In other words, fill in the question marks:''' |
cat ''sequence_file'' | '''?????''' | cat ''sequence_file'' | '''?????''' | ||
− | + | In this case, the steps that the computer needs to complete are as follows: | |
+ | # Concatenate the sequence_file.txt file. | ||
+ | #* cat "sequence_file.txt" | ||
+ | # Replace any "t"s with "u"s when finding the +1, +2, and +3 protein sequences. For the -1, -2, and -3 sequences, we must create the complementary strand, replace each A, T, C, and G with its corresponding RNA base pair (U, A, G, and C), and then reverse the strand. | ||
+ | #* sed "s/t/u/g" | ||
+ | #* sed "s/atcg/uagc/g" | rev | ||
+ | # Remove any necessary bases from the beginning of the sequence in order to start at the correct reading frame. | ||
+ | #* either not applicable, sed "s/^.//g", or sed "s/^..//g" | ||
+ | # Add a space after every codon (every 3 characters). <!-- This may not be necessary depending on the format of the genetic-code.sed file. Come back to this. --> | ||
+ | #* sed "s/.../& /g" | ||
+ | # Reach into the genetic-code.sed file and utilize the sed commands already written into it in order to convert each codon into it's corresponding protein. | ||
+ | #* sed -f genetic-code.sed <!-- found here: http://www.grymoire.com/Unix/Sed.html#uh-16 --> | ||
+ | # Removed all added spaces between codons. | ||
+ | #* sed "s/ //g" | ||
+ | # Remove any left over bases that weren't a part of a codon, and couldn't be used to translate into a protein sequence. | ||
+ | #* sed "s/[aucg]//g" | ||
− | + | Because we are looking at 6 different reading frames on that fragment of DNA, 6 different commands will need to be written for each protein sequence. Each of the following commands represents one reading frame, and is followed by the resulting protein sequence. | |
− | + | ===+1=== | |
+ | cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g" | ||
SGI | SGI | ||
− | + | ===+2=== | |
+ | cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g" | ||
+ | AVY | ||
+ | |||
+ | ===+3=== | ||
+ | |||
+ | cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g" | ||
RY | RY | ||
− | + | ===-1=== | |
− | + | cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g" | |
− | + | VYR | |
− | === | + | ===-2=== |
− | + | cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g" | |
+ | YTA | ||
− | === XMLPipeDB Match Practice | + | ===-3=== |
+ | |||
+ | cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g" | ||
+ | IP | ||
+ | |||
+ | == Check Your Work == | ||
+ | |||
+ | I checked my work with ExPASy Translate Tool. I input my original DNA strand (agcgguauac), and received the following results from the translator: | ||
+ | |||
+ | * 5'3' Frame 1: S G I | ||
+ | * 5'3' Frame 2: A V Y | ||
+ | * 5'3' Frame 3: R Y | ||
+ | * 3'5' Frame 1: V Y R | ||
+ | * 3'5' Frame 2: Y T A | ||
+ | * 3'5' Frame 3: I P | ||
+ | |||
+ | == XMLPipeDB Match Practice == | ||
For your convenience, the XMLPipeDB Match Utility (''xmlpipedb-match-1.1.1.jar'') has been installed in the ''~dondi/xmlpipedb/data'' directory alongside the other practice files. Use this utility to answer the following questions: | For your convenience, the XMLPipeDB Match Utility (''xmlpipedb-match-1.1.1.jar'') has been installed in the ''~dondi/xmlpipedb/data'' directory alongside the other practice files. Use this utility to answer the following questions: | ||
# What Match command tallies the occurrences of the pattern <code>GO:000[567]</code> in the ''493.P_falciparum.xml'' file? | # What Match command tallies the occurrences of the pattern <code>GO:000[567]</code> in the ''493.P_falciparum.xml'' file? | ||
+ | #* java -jar xmlpipedb-match-1.1.1.jar GO:000[567] < 493.P_falciparum.xml <!-- found on https://xmlpipedb.cs.lmu.edu/biodb/fall2013/index.php/Using_the_XMLPipeDB_Match_Utility --> | ||
#* How many unique matches are there? | #* How many unique matches are there? | ||
+ | #** 3 | ||
#* How many times does each unique match appear? | #* How many times does each unique match appear? | ||
+ | #** go:0007: 113 | ||
+ | #** go:0006: 1100 | ||
+ | #** go:0005: 1371 | ||
# Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence. | # Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence. | ||
+ | #* <dbReference type="GO" id="GO:0007010"> | ||
#* Describe how you did this. | #* Describe how you did this. | ||
+ | #** grep "GO:000[567]" 493.P_falciparum.xml | ||
#* Based on where you find this occurrence, what kind of information does this pattern represent? | #* Based on where you find this occurrence, what kind of information does this pattern represent? | ||
+ | #** The pattern "GO:000[567]" represents the id of an item (gene?) of type GO (gene ontology?) | ||
# What Match command tallies the occurrences of the pattern <code>\"Yu.*\"</code> in the ''493.P_falciparum.xml'' file? | # What Match command tallies the occurrences of the pattern <code>\"Yu.*\"</code> in the ''493.P_falciparum.xml'' file? | ||
+ | #* java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\" < 493.P_falciparum.xml | ||
#* How many unique matches are there? | #* How many unique matches are there? | ||
+ | #** 3 | ||
#* How many times does each unique match appear? | #* How many times does each unique match appear? | ||
+ | #** "yu b.": 1 | ||
+ | #** "yu k.": 228 | ||
+ | #** "yu m.": 1 | ||
#* What information do you think this pattern represents? | #* What information do you think this pattern represents? | ||
+ | #** I think this pattern represents people's names. | ||
# Use Match to count the occurrences of the pattern <code>ATG</code> in the ''hs_ref_GRCh37_chr19.fa'' file (this may take a while). Then, use '''grep''' and '''wc''' to do the same thing. | # Use Match to count the occurrences of the pattern <code>ATG</code> in the ''hs_ref_GRCh37_chr19.fa'' file (this may take a while). Then, use '''grep''' and '''wc''' to do the same thing. | ||
+ | #* java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa | ||
+ | #* grep "ATG" hs_ref_GRCh37_chr19.fa | wc | ||
#* What answer does Match give you? | #* What answer does Match give you? | ||
+ | #** Total matches: atg: 830101 | ||
+ | #** Total unique matches: 1 | ||
#* What answer does '''grep''' + '''wc''' give you? | #* What answer does '''grep''' + '''wc''' give you? | ||
+ | #** Lines: 502410 | ||
+ | #** Words: 502410 | ||
+ | #** Characters: 35671048 | ||
#* Explain why the counts are different. (''Hint:'' Make sure you understand what exactly is being counted by each approach.) | #* Explain why the counts are different. (''Hint:'' Make sure you understand what exactly is being counted by each approach.) | ||
+ | #** Match provides a statistic that represents the total number of times the search parameter was found within the file. Within hs_ref_GRCh37_chr19.fa, "ATG" appears 830,101 times. Grep looks for the search parameter in every line of a specific file and returns a list that consolidates every line that has an instance of that search parameter. Wc provides 3 statistics representing line count, word count, and character count respectively. Within hs_ref_GRCh37_chr19.fa, grep found 502,410 lines that contain "ATG". Because there are no spaces in genetic code, each line was considered a word, so wc reported that there are 502,410 words, and a total of 35,671,048 characters accumulated through each of those lines/words. | ||
+ | |||
+ | {{Template: Anuvarsh}} |
Latest revision as of 08:40, 20 September 2015
Contents
The Genetic Code, by Computer
Connect to the my.cs.lmu.edu workstation as shown in class and do the following exercises from there.
I did so by performing the following command:
ssh avarshne@my.cs.lmu.edu
and then inputting my password.
I then created a folder for this class.
mkdir biodb2015
And created a sequence_file.txt file
echo 'agcggtatac' >sequence_file.txt
I then moved to Dondi's repository and copied over some files using the following commands:
cd ~dondi/xmlpipedb/data cp genetic-code.sed ~avarshne/biodb2015 cp xmlpipedb-match-1.1.1.jar ~avarshne/biodb2015 cp prokaryote.txt ~avarshne/biodb2015 cp infA-E.coli-K12.txt ~avarshne/biodb2015 cp 493.P_falciparum.xml ~avarshne/biodb2015 cp hs_ref_GRCh37_chr19.fa ~avarshne/biodb2015
Complement of a Strand
Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. In other words, fill in the question marks:
cat sequence_file | ?????
For example, if sequence_file contains:
agcggtatac
Then your text processing commands should display:
tcgccatatg
In order to do this, I first set out to determine what all needs to be done by the computer consecutively.
- sequence_file.txt must be concatenated in order for any of the next commands to work on the text within that file.
- Replace A, T, C, and G with it's corresponding base pairs.
These steps can be achieved with the following commands, and produces the following result:
cat "sequence_file.txt" | sed "y/atcg/tagc/" tcgccatatg
Reading Frames
Write 6 sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence. In other words, fill in the question marks:
cat sequence_file | ?????
In this case, the steps that the computer needs to complete are as follows:
- Concatenate the sequence_file.txt file.
- cat "sequence_file.txt"
- Replace any "t"s with "u"s when finding the +1, +2, and +3 protein sequences. For the -1, -2, and -3 sequences, we must create the complementary strand, replace each A, T, C, and G with its corresponding RNA base pair (U, A, G, and C), and then reverse the strand.
- sed "s/t/u/g"
- sed "s/atcg/uagc/g" | rev
- Remove any necessary bases from the beginning of the sequence in order to start at the correct reading frame.
- either not applicable, sed "s/^.//g", or sed "s/^..//g"
- Add a space after every codon (every 3 characters).
- sed "s/.../& /g"
- Reach into the genetic-code.sed file and utilize the sed commands already written into it in order to convert each codon into it's corresponding protein.
- sed -f genetic-code.sed
- Removed all added spaces between codons.
- sed "s/ //g"
- Remove any left over bases that weren't a part of a codon, and couldn't be used to translate into a protein sequence.
- sed "s/[aucg]//g"
Because we are looking at 6 different reading frames on that fragment of DNA, 6 different commands will need to be written for each protein sequence. Each of the following commands represents one reading frame, and is followed by the resulting protein sequence.
+1
cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g" SGI
+2
cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g" AVY
+3
cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g" RY
-1
cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g" VYR
-2
cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g" YTA
-3
cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g" IP
Check Your Work
I checked my work with ExPASy Translate Tool. I input my original DNA strand (agcgguauac), and received the following results from the translator:
- 5'3' Frame 1: S G I
- 5'3' Frame 2: A V Y
- 5'3' Frame 3: R Y
- 3'5' Frame 1: V Y R
- 3'5' Frame 2: Y T A
- 3'5' Frame 3: I P
XMLPipeDB Match Practice
For your convenience, the XMLPipeDB Match Utility (xmlpipedb-match-1.1.1.jar) has been installed in the ~dondi/xmlpipedb/data directory alongside the other practice files. Use this utility to answer the following questions:
- What Match command tallies the occurrences of the pattern
GO:000[567]
in the 493.P_falciparum.xml file?- java -jar xmlpipedb-match-1.1.1.jar GO:000[567] < 493.P_falciparum.xml
- How many unique matches are there?
- 3
- How many times does each unique match appear?
- go:0007: 113
- go:0006: 1100
- go:0005: 1371
- Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
- <dbReference type="GO" id="GO:0007010">
- Describe how you did this.
- grep "GO:000[567]" 493.P_falciparum.xml
- Based on where you find this occurrence, what kind of information does this pattern represent?
- The pattern "GO:000[567]" represents the id of an item (gene?) of type GO (gene ontology?)
- What Match command tallies the occurrences of the pattern
\"Yu.*\"
in the 493.P_falciparum.xml file?- java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\" < 493.P_falciparum.xml
- How many unique matches are there?
- 3
- How many times does each unique match appear?
- "yu b.": 1
- "yu k.": 228
- "yu m.": 1
- What information do you think this pattern represents?
- I think this pattern represents people's names.
- Use Match to count the occurrences of the pattern
ATG
in the hs_ref_GRCh37_chr19.fa file (this may take a while). Then, use grep and wc to do the same thing.- java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa
- grep "ATG" hs_ref_GRCh37_chr19.fa | wc
- What answer does Match give you?
- Total matches: atg: 830101
- Total unique matches: 1
- What answer does grep + wc give you?
- Lines: 502410
- Words: 502410
- Characters: 35671048
- Explain why the counts are different. (Hint: Make sure you understand what exactly is being counted by each approach.)
- Match provides a statistic that represents the total number of times the search parameter was found within the file. Within hs_ref_GRCh37_chr19.fa, "ATG" appears 830,101 times. Grep looks for the search parameter in every line of a specific file and returns a list that consolidates every line that has an instance of that search parameter. Wc provides 3 statistics representing line count, word count, and character count respectively. Within hs_ref_GRCh37_chr19.fa, grep found 502,410 lines that contain "ATG". Because there are no spaces in genetic code, each line was considered a word, so wc reported that there are 502,410 words, and a total of 35,671,048 characters accumulated through each of those lines/words.
Other Links
User Page: Anindita Varshneya
Class Page: BIOL/CMSI 367: Biological Databases, Fall 2015
Group Page: GÉNialOMICS
Assignment Pages
Week 1 Assignment
Week 2 Assignment
Week 3 Assignment
Week 4 Assignment
Week 5 Assignment
Week 6 Assignment
Week 7 Assignment
Week 8 Assignment
Week 9 Assignment
Week 10 Assignment
Week 11 Assignment
Week 12 Assignment
No Week 13 Assignment
Week 14 Assignment
Week 15 Assignment
Individual Journals
Individual Journal Week 2
Individual Journal Week 3
Individual Journal Week 4
Individual Journal Week 5
Individual Journal Week 6
Individual Journal Week 7
Individual Journal Week 8
Individual Journal Week 9
Individual Journal Week 10
Individual Journal Week 11
Individual Journal Week 12
Individual Journal Week 14
Individual Journal Week 15
Class Journal Week 1
Class Journal Week 2
Class Journal Week 3
Class Journal Week 4
Class Journal Week 5
Class Journal Week 6
Class Journal Week 7
Class Journal Week 8
Class Journal Week 9
GÉNialOMICS Journal Week 10
GÉNialOMICS Journal Week 11
GÉNialOMICS Journal Week 12
GÉNialOMICS Journal Week 14
GÉNialOMICS Journal Week 15