Difference between revisions of "Nanguiano Week 3"
|  (→XMLPipeDB Match Practice:  answered most of question 4) |  (→XMLPipeDB Match Practice:  finished the last problem) | ||
| Line 84: | Line 84: | ||
| #**GO:008 : 1371 | #**GO:008 : 1371 | ||
| # Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence. | # Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence. | ||
| − | #* One example was: <dbReference type="GO" id="GO:0005622"> | + | #* One example was: <code><dbReference type="GO" id="GO:0005622"></code> | 
| #* Describe how you did this. | #* Describe how you did this. | ||
| #** <code>grep "GO:000[567]" 493.P_falciparum.xml | more</code> | #** <code>grep "GO:000[567]" 493.P_falciparum.xml | more</code> | ||
| Line 111: | Line 111: | ||
| #**Characters: 35671048 | #**Characters: 35671048 | ||
| #* Explain why the counts are different. (''Hint:'' Make sure you understand what exactly is being counted by each approach.) | #* Explain why the counts are different. (''Hint:'' Make sure you understand what exactly is being counted by each approach.) | ||
| + | #** Match is searching for exactly those three letters and the number of times that they appear throughout the entire file. So ATG appears 830,101 times in the file. Grep, on the other hand, searches for the lines that contain that pattern ATG. ATG appears at least once on 502,410 lines, and because each of the lines it appeared in had no spaces, it counts each line as a word. This results in 502,410 words. The number of characters that consist of those 502,410 lines is 35,671,048.  | ||
| + | #** To illustrate that this is the case, I ran a few experiments. Running <code>grep -v "ATG" hs_ref_GRCh37_chr19.fa | wc</code> returns this output:  | ||
| + | #*** Lines: 299182   | ||
| + | #*** Words: 299244  | ||
| + | #*** Characters: 21242050 | ||
| + | #** Running <code>wc hs_ref_GRCh37_chr19.fa </code> returns the following output:  | ||
| + | #*** Lines: 801592    | ||
| + | #*** Words: 801654  | ||
| + | #*** Characters: 56913098 | ||
| + | #** Adding up the results of the <code>grep "ATG" hs_ref_GRCh37_chr19.fa | wc</code> and <code>grep -v "ATG" hs_ref_GRCh37_chr19.fa | wc</code> gives the following result, illustrating that my theory on what grep is counting was correct: | ||
| + | #*** Lines: 801492 | ||
| + | #*** Words: 801654 | ||
| + | #*** Characters: 56913098 | ||
| == Links == | == Links == | ||
| {{Template:Nanguiano}} | {{Template:Nanguiano}} | ||
Revision as of 23:42, 15 September 2015
Contents
The Genetic Code, by Computer
Connect to the my.cs.lmu.edu workstation as shown in class and do the following exercises from there.
For this exercise, I performed the following series of commands to prepare for the assignment.
ssh my.cs.lmu.edu -l nanguia1 mkdir biodb cat >"sequence_file.txt" agcggtatac cd biodb mkdir week3 mv sequence_file.txt biodb/week3 cd ~dondi/xmlpipedb/data cp genetic-code.sed ~nanguia1/biodb/week3 cp xmlpipedb-match-1.1.1.jar ~nanguia1/biodb/week3 cp 493.P_falciparum.xml ~nanguia1/biodb/week3 cp hs_ref_GRCh37_chr19.fa ~nanguia1/biodb/week3 cd ~nanguia1/biodb/week3
Complement of a Strand
Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand.
On a sequence_file.txt file containing the sequence "agcggtatac", the command and output was as follows:
cat sequence_file.txt | sed "y/atgc/tacg/" tcgccatatg
Reading Frames
Write 6 sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence. You should have 6 different sets of commands, one for each possible reading frame.
On a sequence_file.txt containing the sequence "agcggtatac", the command and output was as follows:
+1
cat sequence_file.txt | sed "s/.../& /g" | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[acgu]//g" SGI
+2
cat sequence_file.txt | sed "s/^.//g" | sed "s/.../& /g" | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[acgu]//g" AVY
+3
cat sequence_file.txt | sed "s/^..//g" | sed "s/.../& /g" | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[acgu]//g" RY
The remaining three were divided onto two lines on this wiki because they could not fit onto one without causing graphical bugs. The actual command was written without newlines.
-1
cat sequence_file.txt | sed "y/acgt/tgca/" | rev | sed "s/.../& /g" | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[acgu]//g" VYR
-2
cat sequence_file.txt | sed "y/acgt/tgca/" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[acgu]//g" YTA
-3
cat sequence_file.txt | sed "y/acgt/tgca/" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[acgu]//g" IP
Check Your Work
Utilizing the ExPASy Translate Tool, I inputted my sample dna sequence, "agcggtatac". The result was as follows:
XMLPipeDB Match Practice
For your convenience, the XMLPipeDB Match Utility (xmlpipedb-match-1.1.1.jar) has been installed in the ~dondi/xmlpipedb/data directory alongside the other practice files. Use this utility to answer the following questions:
Note: I used this wiki page to learn about the match utility.
-  What Match command tallies the occurrences of the pattern GO:000[567]in the 493.P_falciparum.xml file?-  java -jar xmlpipedb-match-1.1.1.jar GO:000[567] < 493.P_falciparum.xml
-  How many unique matches are there?
- 3
 
-  How many times does each unique match appear?
- GO:007 : 113
- GO:006 : 1100
- GO:008 : 1371
 
 
-  
-  Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
-  One example was: <dbReference type="GO" id="GO:0005622">
-  Describe how you did this.
-  grep "GO:000[567]" 493.P_falciparum.xml | more
 
-  
-  Based on where you find this occurrence, what kind of information does this pattern represent?
- Based on where I found it, this pattern shows the gene ontology ID of a particular gene in the database.
 
 
-  One example was: 
-  What Match command tallies the occurrences of the pattern \"Yu.*\"in the 493.P_falciparum.xml file?-  java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\" < 493.P_falciparum.xml
-  How many unique matches are there?
- 3
 
-  How many times does each unique match appear?
- "Yu b." : 1
- "Yu k." : 228
- "Yu m." : 1
 
-  What information do you think this pattern represents?
- I believe this pattern represents a name.
- This was confirmed by running the command grep "Yu.*" 493.P_falciparum.xml
 
 
-  
-  Use Match to count the occurrences of the pattern ATGin the hs_ref_GRCh37_chr19.fa file (this may take a while). Then, use grep and wc to do the same thing.-  What answer does Match give you?
- java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa
- Total unique matches: 1
- Number of matches: 830101
 
-  What answer does grep + wc give you?
- grep "ATG" hs_ref_GRCh37_chr19.fa | wc
- Lines: 502410
- Words: 502410
- Characters: 35671048
 
-  Explain why the counts are different. (Hint: Make sure you understand what exactly is being counted by each approach.)
- Match is searching for exactly those three letters and the number of times that they appear throughout the entire file. So ATG appears 830,101 times in the file. Grep, on the other hand, searches for the lines that contain that pattern ATG. ATG appears at least once on 502,410 lines, and because each of the lines it appeared in had no spaces, it counts each line as a word. This results in 502,410 words. The number of characters that consist of those 502,410 lines is 35,671,048.
-  To illustrate that this is the case, I ran a few experiments. Running grep -v "ATG" hs_ref_GRCh37_chr19.fa | wcreturns this output:- Lines: 299182
- Words: 299244
- Characters: 21242050
 
-  Running wc hs_ref_GRCh37_chr19.fareturns the following output:- Lines: 801592
- Words: 801654
- Characters: 56913098
 
-  Adding up the results of the grep "ATG" hs_ref_GRCh37_chr19.fa | wcandgrep -v "ATG" hs_ref_GRCh37_chr19.fa | wcgives the following result, illustrating that my theory on what grep is counting was correct:- Lines: 801492
- Words: 801654
- Characters: 56913098
 
 
 
-  What answer does Match give you?
Links
 Nicole Anguiano
 BIOL 367, Fall 2015
Assignment Links
- Week 1 Assignment
- Week 2 Assignment
- Week 3 Assignment
- Week 4 Assignment
- Week 5 Assignment
- Week 6 Assignment
- Week 7 Assignment
- Week 8 Assignment
- Week 9 Assignment
- Week 10 Assignment
- Week 11 Assignment
- Week 12 Assignment
- Week 14 Assignment
- Week 15 Assignment
Individual Journals
- Individual Journal Week 2
- Individual Journal Week 3
- Individual Journal Week 4
- Individual Journal Week 5
- Individual Journal Week 6
- Individual Journal Week 7
- Individual Journal Week 8
- Individual Journal Week 9
- Individual Journal Week 10
- Individual Journal Week 11
- Individual Assessment
- Deliverables


