Difference between revisions of "Msaeedi23 Week 3"

From LMU BioDB 2015
Jump to: navigation, search
(creating complement of strand)
(invoke template)
 
(11 intermediate revisions by the same user not shown)
Line 10: Line 10:
 
Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. In other words, fill in the question marks:
 
Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. In other words, fill in the question marks:
  
   Implemented the command cat "sequence_file" | sed "y/atcg/tagc/"
+
   Implemented the command: cat "sequence_file" | sed "y/atcg/tagc/"
  
 
==== Reading Frames ====
 
==== Reading Frames ====
Line 22: Line 22:
 
     agcggtatac
 
     agcggtatac
  
Then your text processing commands for 5’-3’ frame 1 should display:
+
Frame +1: Goal is to seperate the sequence into groups of 3 nucleotides
 +
cat "sequence_file.txt" | sed "s/.../& /g" | <!-- & replaces the 3 nucleotides with the same 3 except with a space --> sed "s/t/u/g" | <!-- replace t's with u's --> sed -f genetic-code.sed | <!-- input specific set of rules --> sed "s/  / /g" | <!--deleting any additional spaces between sets of nucleotides --> sed "s/[acgu]/ /g" <!--decoding sequence into specific amino acid -->
  
    SGI
+
Frame +2:
 +
cat "sequence_file.txt" | sed "s/^./ /g" | <!-- delete first character of each input line --> sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/  / /g" | sed "s/[acgu]/ /g"
  
Your text processing commands for 5’-3’ frame 3 should display:
+
Frame +3:
 +
cat "sequence_file.txt" | sed "s/^../ /g" | <!-- delete first two characters of each input line --> sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/  / /g" | sed "s/[acgu]/ /g"
  
    RY
+
Frame -1:
 +
cat "sequence_file.txt" | sed "y/acgt/tgca/" | rev | sed "s/.../& /g" | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/  / /g" | sed "s/[acgu]/ /g" <!-- once sequence has been reversed, use same command inputs as + frames -->
  
...and so on.
+
Frame -2:
 +
cat "sequence_file.txt" | sed "y/acgt/tgca/" | rev | sed "s/^./ /g" | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/  / /g" | sed "s/[acgu]/ /g"
  
* '''Hint 1:''' The 6 sets of commands are very similar to each other.
+
Frame -3:
* '''Hint 2:''' Under the ''~dondi/xmlpipedb/data'' directory in the Keck lab, you will find a file called ''genetic-code.sed''. To save you some typing, this file has already been prepared with the correct sequence of '''sed''' commands for converting any base triplets into the corresponding amino acid. For example, this line in that file: <pre>s/ugc/C/g</pre> ...corresponds to a uracil-guanine-cytosine sequence transcribing to the cysteine amino acid (C)The trick is to figure out how to use this file to your advantage, in the commands that you'll be forming.
+
cat "sequence_file.txt" | sed "y/actg/tgca/" | rev | sed "s^../ /g" | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/ / /g" | sed "s/[acgu]/ /g"
  
 
==== Check Your Work ====
 
==== Check Your Work ====
Line 44: Line 49:
  
 
# What Match command tallies the occurrences of the pattern <code>GO:000[567]</code> in the ''493.P_falciparum.xml'' file?
 
# What Match command tallies the occurrences of the pattern <code>GO:000[567]</code> in the ''493.P_falciparum.xml'' file?
 +
#* <code>java -jar xmlpipedb-match-1.1.1.jar GO:000[567] < 493.P_falciparum.xml</code>
 
#* How many unique matches are there?
 
#* How many unique matches are there?
 +
#** 3
 
#* How many times does each unique match appear?
 
#* How many times does each unique match appear?
 +
#**GO:007- 113
 +
#**GO:006- 1100
 +
#**GO:008- 1371
 
# Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
 
# Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
 +
#* example: <code><dbreference type ="GO" id="GO:0007264"><code>
 
#* Describe how you did this.
 
#* Describe how you did this.
 +
#** using the grep and more commands
 
#* Based on where you find this occurrence, what kind of information does this pattern represent?
 
#* Based on where you find this occurrence, what kind of information does this pattern represent?
 +
#** the pattern represents the GO-gene ontology of a gene in this database.
 
# What Match command tallies the occurrences of the pattern <code>\"Yu.*\"</code> in the ''493.P_falciparum.xml'' file?
 
# What Match command tallies the occurrences of the pattern <code>\"Yu.*\"</code> in the ''493.P_falciparum.xml'' file?
 +
<code>java -jar xmlpipedb-match-1.1.1.jar\*Yu.*\" < 493.P_falciparum.xml</code>
 
#* How many unique matches are there?
 
#* How many unique matches are there?
 +
#** 3
 
#* How many times does each unique match appear?
 
#* How many times does each unique match appear?
 +
#**"Yu b." - 1
 +
#**"Yu k." - 228
 +
#**"Yu m." - 1
 
#* What information do you think this pattern represents?
 
#* What information do you think this pattern represents?
 +
#** this pattern represents a title or a name
 
# Use Match to count the occurrences of the pattern <code>ATG</code> in the ''hs_ref_GRCh37_chr19.fa'' file (this may take a while).  Then, use '''grep''' and '''wc''' to do the same thing.
 
# Use Match to count the occurrences of the pattern <code>ATG</code> in the ''hs_ref_GRCh37_chr19.fa'' file (this may take a while).  Then, use '''grep''' and '''wc''' to do the same thing.
 
#* What answer does Match give you?
 
#* What answer does Match give you?
 +
#** using <code>java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa</code> Match provided the following info:
 +
#** 1 total unique matches
 +
#** 830101 is the number of matches
 
#* What answer does '''grep''' + '''wc''' give you?
 
#* What answer does '''grep''' + '''wc''' give you?
 +
#** using grep and wc commands provided this:
 +
#**lines: 502410
 +
#**words: 502410
 +
#**characters: 35671048
 
#* Explain why the counts are different. (''Hint:'' Make sure you understand what exactly is being counted by each approach.)
 
#* Explain why the counts are different. (''Hint:'' Make sure you understand what exactly is being counted by each approach.)
 +
#** Using Match searches for the three letter (ATG) and the number of times this sequence appears in the entire file. This showed that ATG appeared 830,101 times. Grep, on the other hand, counts the number of lines in which the sequence appears at least once. Meaning that ATG was found at least once on 502410 lines. The word count is the same due to the fact that there is no space between letters of a single line. Finally, there is a total amount of 35,671,048 characters in all the lines combined.
 +
 +
{{Template:Msaeedi23}}

Latest revision as of 04:31, 22 September 2015

The Genetic Code, by Computer

Connect to the my.cs.lmu.edu workstation as shown in class and do the following exercises from there.

For these exercises, two files are available in the Keck lab system for practice; of course, you can always make your own sequences up. The practice files are ~dondi/xmlpipedb/data/prokaryote.txt and ~dondi/xmlpipedb/data/infA-E.coli-K12.txt.


Complement of a Strand

Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. In other words, fill in the question marks:

  Implemented the command: cat "sequence_file" | sed "y/atcg/tagc/"

Reading Frames

Write 6 sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence. In other words, fill in the question marks:

   cat sequence_file | ?????

You should have 6 different sets of commands, one for each possible reading frame. For example, if sequence_file contains:

   agcggtatac

Frame +1: Goal is to seperate the sequence into groups of 3 nucleotides

cat "sequence_file.txt" | sed "s/.../& /g" |  sed "s/t/u/g" |  sed -f genetic-code.sed |  sed "s/  / /g" |  sed "s/[acgu]/ /g"  

Frame +2:

cat "sequence_file.txt" | sed "s/^./ /g" |  sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/  / /g" | sed "s/[acgu]/ /g"

Frame +3:

cat "sequence_file.txt" | sed "s/^../ /g" |  sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/  / /g" | sed "s/[acgu]/ /g"

Frame -1:

cat "sequence_file.txt" | sed "y/acgt/tgca/" | rev | sed "s/.../& /g" | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/  / /g" | sed "s/[acgu]/ /g" 

Frame -2:

cat "sequence_file.txt" | sed "y/acgt/tgca/" | rev | sed "s/^./ /g" | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/  / /g" | sed "s/[acgu]/ /g"

Frame -3:

cat "sequence_file.txt" | sed "y/actg/tgca/" | rev | sed "s^../ /g" | sed "s/t/u/g" | sed -f genetic-code.sed | sed "s/  / /g" | sed "s/[acgu]/ /g"

Check Your Work

Fortunately, online tools are available for checking your work; we recommend the ExPASy Translate Tool, sponsored by the same people who run SwissProt. You’re free to use this tool to see if your text processing commands produce the same results.

XMLPipeDB Match Practice

For your convenience, the XMLPipeDB Match Utility (xmlpipedb-match-1.1.1.jar) has been installed in the ~dondi/xmlpipedb/data directory alongside the other practice files. Use this utility to answer the following questions:

  1. What Match command tallies the occurrences of the pattern GO:000[567] in the 493.P_falciparum.xml file?
    • java -jar xmlpipedb-match-1.1.1.jar GO:000[567] < 493.P_falciparum.xml
    • How many unique matches are there?
      • 3
    • How many times does each unique match appear?
      • GO:007- 113
      • GO:006- 1100
      • GO:008- 1371
  2. Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
    • example: <dbreference type ="GO" id="GO:0007264"><code>
    • Describe how you did this.
      • using the grep and more commands
    • Based on where you find this occurrence, what kind of information does this pattern represent?
      • the pattern represents the GO-gene ontology of a gene in this database.
  3. What Match command tallies the occurrences of the pattern <code>\"Yu.*\" in the 493.P_falciparum.xml file?

java -jar xmlpipedb-match-1.1.1.jar\*Yu.*\" < 493.P_falciparum.xml

    • How many unique matches are there?
      • 3
    • How many times does each unique match appear?
      • "Yu b." - 1
      • "Yu k." - 228
      • "Yu m." - 1
    • What information do you think this pattern represents?
      • this pattern represents a title or a name
  1. Use Match to count the occurrences of the pattern ATG in the hs_ref_GRCh37_chr19.fa file (this may take a while). Then, use grep and wc to do the same thing.
    • What answer does Match give you?
      • using java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa Match provided the following info:
      • 1 total unique matches
      • 830101 is the number of matches
    • What answer does grep + wc give you?
      • using grep and wc commands provided this:
      • lines: 502410
      • words: 502410
      • characters: 35671048
    • Explain why the counts are different. (Hint: Make sure you understand what exactly is being counted by each approach.)
      • Using Match searches for the three letter (ATG) and the number of times this sequence appears in the entire file. This showed that ATG appeared 830,101 times. Grep, on the other hand, counts the number of lines in which the sequence appears at least once. Meaning that ATG was found at least once on 502410 lines. The word count is the same due to the fact that there is no space between letters of a single line. Finally, there is a total amount of 35,671,048 characters in all the lines combined.

Mahrad Saeedi

Class Whoopers Team Page
Assignment Links
Individual Journals
Shared Journals