Difference between revisions of "Anuvarsh Week 3"

From LMU BioDB 2015
Jump to: navigation, search
(Copied Week 3 homework assignment)
 
m (Added template)
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
=== The Genetic Code, by Computer ===
+
= The Genetic Code, by Computer =
  
Connect to the ''my.cs.lmu.edu'' workstation as shown in class and do the following exercises from there.
+
'''Connect to the ''my.cs.lmu.edu'' workstation as shown in class and do the following exercises from there.'''
  
For these exercises, two files are available in the Keck lab system for practice; of course, you can always make your own sequences up. The practice files are ''~dondi/xmlpipedb/data/prokaryote.txt'' and ''~dondi/xmlpipedb/data/infA-E.coli-K12.txt''.
+
I did so by performing the following command:
  
==== Complement of a Strand ====
+
    ssh avarshne@my.cs.lmu.edu
  
Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. In other words, fill in the question marks:
+
and then inputting my password.
 +
 
 +
I then created a folder for this class. <!-- found command with a little googling: http://www.computerhope.com/issues/ch000742.htm -->
 +
 
 +
    mkdir biodb2015
 +
 
 +
And created a sequence_file.txt file <!-- taking the homework very literally, found the command here: http://unix.stackexchange.com/questions/159672/how-to-create-a-simple-txt-text-file-using-terminal -->
 +
 
 +
    echo 'agcggtatac' >sequence_file.txt
 +
 
 +
I then moved to Dondi's repository and copied over some files using the following commands:
 +
 
 +
    cd ~dondi/xmlpipedb/data
 +
    cp genetic-code.sed ~avarshne/biodb2015
 +
    cp xmlpipedb-match-1.1.1.jar ~avarshne/biodb2015
 +
    cp prokaryote.txt ~avarshne/biodb2015
 +
    cp infA-E.coli-K12.txt ~avarshne/biodb2015
 +
    cp 493.P_falciparum.xml ~avarshne/biodb2015
 +
    cp hs_ref_GRCh37_chr19.fa ~avarshne/biodb2015
 +
 
 +
 
 +
== Complement of a Strand ==
 +
 
 +
'''Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. In other words, fill in the question marks:
  
 
     cat ''sequence_file'' | '''?????'''
 
     cat ''sequence_file'' | '''?????'''
Line 17: Line 40:
 
Then your text processing commands should display:
 
Then your text processing commands should display:
  
 +
    tcgccatatg'''
 +
 +
In order to do this, I first set out to determine what all needs to be done by the computer consecutively.
 +
# sequence_file.txt must be concatenated in order for any of the next commands to work on the text within that file.
 +
# Replace A, T, C, and G with it's corresponding base pairs.
 +
 +
These steps can be achieved with the following commands, and produces the following result:
 +
 +
    cat "sequence_file.txt" | sed "y/atcg/tagc/"
 
     tcgccatatg
 
     tcgccatatg
  
==== Reading Frames ====
+
== Reading Frames ==
  
Write ''6'' sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence. In other words, fill in the question marks:
+
'''Write ''6'' sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence. In other words, fill in the question marks:'''
  
 
     cat ''sequence_file'' | '''?????'''
 
     cat ''sequence_file'' | '''?????'''
  
You should have 6 different sets of commands, one for each possible reading frame. For example, if ''sequence_file'' contains:
+
In this case, the steps that the computer needs to complete are as follows:
 +
# Concatenate the sequence_file.txt file.
 +
#* cat "sequence_file.txt"
 +
# Replace any "t"s with "u"s when finding the +1, +2, and +3 protein sequences. For the -1, -2, and -3 sequences, we must create the complementary strand, replace each A, T, C, and G with its corresponding RNA base pair (U, A, G, and C), and then reverse the strand.
 +
#* sed "s/t/u/g"
 +
#* sed "s/atcg/uagc/g" | rev
 +
# Remove any necessary bases from the beginning of the sequence in order to start at the correct reading frame.
 +
#* either not applicable, sed "s/^.//g", or sed "s/^..//g"
 +
# Add a space after every codon (every 3 characters). <!-- This may not be necessary depending on the format of the genetic-code.sed file. Come back to this. -->
 +
#* sed "s/.../& /g"
 +
# Reach into the genetic-code.sed file and utilize the sed commands already written into it in order to convert each codon into it's corresponding protein.
 +
#* sed -f genetic-code.sed <!-- found here: http://www.grymoire.com/Unix/Sed.html#uh-16 -->
 +
# Removed all added spaces between codons.
 +
#* sed "s/ //g"
 +
# Remove any left over bases that weren't a part of a codon, and couldn't be used to translate into a protein sequence.
 +
#* sed "s/[aucg]//g"
  
    agcggtatac
+
Because we are looking at 6 different reading frames on that fragment of DNA, 6 different commands will need to be written for each protein sequence. Each of the following commands represents one reading frame, and is followed by the resulting protein sequence.
  
Then your text processing commands for 5’-3’ frame 1 should display:
+
===+1===
  
 +
    cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
 
     SGI
 
     SGI
  
Your text processing commands for 5’-3’ frame 3 should display:
+
===+2===
  
 +
    cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
 +
    AVY
 +
 +
===+3===
 +
 +
    cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
 
     RY
 
     RY
  
...and so on.
+
===-1===
  
* '''Hint 1:''' The 6 sets of commands are very similar to each other.
+
    cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
* '''Hint 2:''' Under the ''~dondi/xmlpipedb/data'' directory in the Keck lab, you will find a file called ''genetic-code.sed''.  To save you some typing, this file has already been prepared with the correct sequence of '''sed''' commands for converting any base triplets into the corresponding amino acid.  For example, this line in that file: <pre>s/ugc/C/g</pre> ...corresponds to a uracil-guanine-cytosine sequence transcribing to the cysteine amino acid (C).  The trick is to figure out how to use this file to your advantage, in the commands that you'll be forming.
+
    VYR
  
==== Check Your Work ====
+
===-2===
  
Fortunately, online tools are available for checking your work; we recommend the ExPASy Translate Tool, sponsored by the same people who run SwissProt. You’re free to use this tool to see if your text processing commands produce the same results.
+
    cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
 +
    YTA
  
=== XMLPipeDB Match Practice ===
+
===-3===
 +
 
 +
    cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
 +
    IP
 +
 
 +
== Check Your Work ==
 +
 
 +
I checked my work with ExPASy Translate Tool. I input my original DNA strand (agcgguauac), and received the following results from the translator:
 +
 
 +
* 5'3' Frame 1: S G I
 +
* 5'3' Frame 2: A V Y
 +
* 5'3' Frame 3: R Y
 +
* 3'5' Frame 1: V Y R
 +
* 3'5' Frame 2: Y T A
 +
* 3'5' Frame 3: I P
 +
 
 +
== XMLPipeDB Match Practice ==
  
 
For your convenience, the XMLPipeDB Match Utility (''xmlpipedb-match-1.1.1.jar'') has been installed in the ''~dondi/xmlpipedb/data'' directory alongside the other practice files. Use this utility to answer the following questions:
 
For your convenience, the XMLPipeDB Match Utility (''xmlpipedb-match-1.1.1.jar'') has been installed in the ''~dondi/xmlpipedb/data'' directory alongside the other practice files. Use this utility to answer the following questions:
  
 
# What Match command tallies the occurrences of the pattern <code>GO:000[567]</code> in the ''493.P_falciparum.xml'' file?
 
# What Match command tallies the occurrences of the pattern <code>GO:000[567]</code> in the ''493.P_falciparum.xml'' file?
 +
#* java -jar xmlpipedb-match-1.1.1.jar GO:000[567] < 493.P_falciparum.xml <!-- found on https://xmlpipedb.cs.lmu.edu/biodb/fall2013/index.php/Using_the_XMLPipeDB_Match_Utility -->
 
#* How many unique matches are there?
 
#* How many unique matches are there?
 +
#** 3
 
#* How many times does each unique match appear?
 
#* How many times does each unique match appear?
 +
#** go:0007: 113
 +
#** go:0006: 1100
 +
#** go:0005: 1371
 
# Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
 
# Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
 +
#* <dbReference type="GO" id="GO:0007010">
 
#* Describe how you did this.
 
#* Describe how you did this.
 +
#** grep "GO:000[567]" 493.P_falciparum.xml
 
#* Based on where you find this occurrence, what kind of information does this pattern represent?
 
#* Based on where you find this occurrence, what kind of information does this pattern represent?
 +
#** The pattern "GO:000[567]" represents the id of an item (gene?) of type GO (gene ontology?)
 
# What Match command tallies the occurrences of the pattern <code>\"Yu.*\"</code> in the ''493.P_falciparum.xml'' file?
 
# What Match command tallies the occurrences of the pattern <code>\"Yu.*\"</code> in the ''493.P_falciparum.xml'' file?
 +
#* java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\" < 493.P_falciparum.xml
 
#* How many unique matches are there?
 
#* How many unique matches are there?
 +
#** 3
 
#* How many times does each unique match appear?
 
#* How many times does each unique match appear?
 +
#** "yu b.": 1
 +
#** "yu k.": 228
 +
#** "yu m.": 1
 
#* What information do you think this pattern represents?
 
#* What information do you think this pattern represents?
 +
#** I think this pattern represents people's names.
 
# Use Match to count the occurrences of the pattern <code>ATG</code> in the ''hs_ref_GRCh37_chr19.fa'' file (this may take a while).  Then, use '''grep''' and '''wc''' to do the same thing.
 
# Use Match to count the occurrences of the pattern <code>ATG</code> in the ''hs_ref_GRCh37_chr19.fa'' file (this may take a while).  Then, use '''grep''' and '''wc''' to do the same thing.
 +
#* java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa
 +
#* grep "ATG" hs_ref_GRCh37_chr19.fa | wc
 
#* What answer does Match give you?
 
#* What answer does Match give you?
 +
#** Total matches: atg: 830101
 +
#** Total unique matches: 1
 
#* What answer does '''grep''' + '''wc''' give you?
 
#* What answer does '''grep''' + '''wc''' give you?
 +
#**  Lines: 502410 
 +
#** Words: 502410
 +
#** Characters: 35671048
 
#* Explain why the counts are different. (''Hint:'' Make sure you understand what exactly is being counted by each approach.)
 
#* Explain why the counts are different. (''Hint:'' Make sure you understand what exactly is being counted by each approach.)
 +
#** Match provides a statistic that represents the total number of times the search parameter was found within the file. Within hs_ref_GRCh37_chr19.fa, "ATG" appears 830,101 times. Grep looks for the search parameter in every line of a specific file and returns a list that consolidates every line that has an instance of that search parameter. Wc provides 3 statistics representing line count, word count, and character count respectively. Within hs_ref_GRCh37_chr19.fa, grep found 502,410 lines that contain "ATG". Because there are no spaces in genetic code, each line was considered a word, so wc reported that there are 502,410 words, and a total of 35,671,048 characters accumulated through each of those lines/words.
 +
 +
{{Template: Anuvarsh}}

Latest revision as of 08:40, 20 September 2015

The Genetic Code, by Computer

Connect to the my.cs.lmu.edu workstation as shown in class and do the following exercises from there.

I did so by performing the following command:

   ssh avarshne@my.cs.lmu.edu

and then inputting my password.

I then created a folder for this class.

   mkdir biodb2015

And created a sequence_file.txt file

   echo 'agcggtatac' >sequence_file.txt

I then moved to Dondi's repository and copied over some files using the following commands:

   cd ~dondi/xmlpipedb/data
   cp genetic-code.sed ~avarshne/biodb2015
   cp xmlpipedb-match-1.1.1.jar ~avarshne/biodb2015
   cp prokaryote.txt ~avarshne/biodb2015
   cp infA-E.coli-K12.txt ~avarshne/biodb2015
   cp 493.P_falciparum.xml ~avarshne/biodb2015
   cp hs_ref_GRCh37_chr19.fa ~avarshne/biodb2015


Complement of a Strand

Write a sequence of piped text processing commands that, when given a nucleotide sequence, returns its complementary strand. In other words, fill in the question marks:

   cat sequence_file | ?????

For example, if sequence_file contains:

   agcggtatac

Then your text processing commands should display:

   tcgccatatg

In order to do this, I first set out to determine what all needs to be done by the computer consecutively.

  1. sequence_file.txt must be concatenated in order for any of the next commands to work on the text within that file.
  2. Replace A, T, C, and G with it's corresponding base pairs.

These steps can be achieved with the following commands, and produces the following result:

   cat "sequence_file.txt" | sed "y/atcg/tagc/" 
   tcgccatatg

Reading Frames

Write 6 sets of text processing commands that, when given a nucleotide sequence, returns the resulting amino acid sequence, one for each possible reading frame for the nucleotide sequence. In other words, fill in the question marks:

   cat sequence_file | ?????

In this case, the steps that the computer needs to complete are as follows:

  1. Concatenate the sequence_file.txt file.
    • cat "sequence_file.txt"
  2. Replace any "t"s with "u"s when finding the +1, +2, and +3 protein sequences. For the -1, -2, and -3 sequences, we must create the complementary strand, replace each A, T, C, and G with its corresponding RNA base pair (U, A, G, and C), and then reverse the strand.
    • sed "s/t/u/g"
    • sed "s/atcg/uagc/g" | rev
  3. Remove any necessary bases from the beginning of the sequence in order to start at the correct reading frame.
    • either not applicable, sed "s/^.//g", or sed "s/^..//g"
  4. Add a space after every codon (every 3 characters).
    • sed "s/.../& /g"
  5. Reach into the genetic-code.sed file and utilize the sed commands already written into it in order to convert each codon into it's corresponding protein.
    • sed -f genetic-code.sed
  6. Removed all added spaces between codons.
    • sed "s/ //g"
  7. Remove any left over bases that weren't a part of a codon, and couldn't be used to translate into a protein sequence.
    • sed "s/[aucg]//g"

Because we are looking at 6 different reading frames on that fragment of DNA, 6 different commands will need to be written for each protein sequence. Each of the following commands represents one reading frame, and is followed by the resulting protein sequence.

+1

   cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
   SGI

+2

   cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
   AVY

+3

   cat "sequence_file.txt" | sed "s/t/u/g" | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
   RY

-1

   cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
   VYR

-2

   cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/^.//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
   YTA

-3

   cat "sequence_file.txt" | sed "y/atcg/uagc/" | rev | sed "s/^..//g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[aucg]//g"
   IP

Check Your Work

I checked my work with ExPASy Translate Tool. I input my original DNA strand (agcgguauac), and received the following results from the translator:

  • 5'3' Frame 1: S G I
  • 5'3' Frame 2: A V Y
  • 5'3' Frame 3: R Y
  • 3'5' Frame 1: V Y R
  • 3'5' Frame 2: Y T A
  • 3'5' Frame 3: I P

XMLPipeDB Match Practice

For your convenience, the XMLPipeDB Match Utility (xmlpipedb-match-1.1.1.jar) has been installed in the ~dondi/xmlpipedb/data directory alongside the other practice files. Use this utility to answer the following questions:

  1. What Match command tallies the occurrences of the pattern GO:000[567] in the 493.P_falciparum.xml file?
    • java -jar xmlpipedb-match-1.1.1.jar GO:000[567] < 493.P_falciparum.xml
    • How many unique matches are there?
      • 3
    • How many times does each unique match appear?
      • go:0007: 113
      • go:0006: 1100
      • go:0005: 1371
  2. Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence.
    • <dbReference type="GO" id="GO:0007010">
    • Describe how you did this.
      • grep "GO:000[567]" 493.P_falciparum.xml
    • Based on where you find this occurrence, what kind of information does this pattern represent?
      • The pattern "GO:000[567]" represents the id of an item (gene?) of type GO (gene ontology?)
  3. What Match command tallies the occurrences of the pattern \"Yu.*\" in the 493.P_falciparum.xml file?
    • java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\" < 493.P_falciparum.xml
    • How many unique matches are there?
      • 3
    • How many times does each unique match appear?
      • "yu b.": 1
      • "yu k.": 228
      • "yu m.": 1
    • What information do you think this pattern represents?
      • I think this pattern represents people's names.
  4. Use Match to count the occurrences of the pattern ATG in the hs_ref_GRCh37_chr19.fa file (this may take a while). Then, use grep and wc to do the same thing.
    • java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa
    • grep "ATG" hs_ref_GRCh37_chr19.fa | wc
    • What answer does Match give you?
      • Total matches: atg: 830101
      • Total unique matches: 1
    • What answer does grep + wc give you?
      • Lines: 502410
      • Words: 502410
      • Characters: 35671048
    • Explain why the counts are different. (Hint: Make sure you understand what exactly is being counted by each approach.)
      • Match provides a statistic that represents the total number of times the search parameter was found within the file. Within hs_ref_GRCh37_chr19.fa, "ATG" appears 830,101 times. Grep looks for the search parameter in every line of a specific file and returns a list that consolidates every line that has an instance of that search parameter. Wc provides 3 statistics representing line count, word count, and character count respectively. Within hs_ref_GRCh37_chr19.fa, grep found 502,410 lines that contain "ATG". Because there are no spaces in genetic code, each line was considered a word, so wc reported that there are 502,410 words, and a total of 35,671,048 characters accumulated through each of those lines/words.

Other Links

User Page: Anindita Varshneya
Class Page: BIOL/CMSI 367: Biological Databases, Fall 2015
Group Page: GÉNialOMICS

Assignment Pages

Week 1 Assignment
Week 2 Assignment
Week 3 Assignment
Week 4 Assignment
Week 5 Assignment
Week 6 Assignment
Week 7 Assignment
Week 8 Assignment
Week 9 Assignment
Week 10 Assignment
Week 11 Assignment
Week 12 Assignment
No Week 13 Assignment
Week 14 Assignment
Week 15 Assignment

Individual Journals

Individual Journal Week 2
Individual Journal Week 3
Individual Journal Week 4
Individual Journal Week 5
Individual Journal Week 6
Individual Journal Week 7
Individual Journal Week 8
Individual Journal Week 9
Individual Journal Week 10
Individual Journal Week 11
Individual Journal Week 12
Individual Journal Week 14
Individual Journal Week 15

Shared Journals

Class Journal Week 1
Class Journal Week 2
Class Journal Week 3
Class Journal Week 4
Class Journal Week 5
Class Journal Week 6
Class Journal Week 7
Class Journal Week 8
Class Journal Week 9
GÉNialOMICS Journal Week 10
GÉNialOMICS Journal Week 11
GÉNialOMICS Journal Week 12
GÉNialOMICS Journal Week 14
GÉNialOMICS Journal Week 15