Blitvak Week 3

From LMU BioDB 2015
Revision as of 00:35, 21 September 2015 by Blitvak (Talk | contribs) (first version of individual journal assignment wk3)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Individual Journal Assignment Week 3

Initial Preparations

  • PuTTY was downloaded, installed, and initialized
  • Connected to workstation via PuTTY
  • Entered ~dondi/xmlpipedb/data using cd ~dondi/xmlpipedb/data
  • All results were checked using the ExPASy Translate Tool and the Nucleic Acid Sequence Massager, provided by Attotron
  • prokaryote.txt in ~dondi/xmlpipedb/data was examined using cat prokaryote.txt
  • prokaryote.txt was chosen for use in the first part of this assignment
  • The sequence in prokaryote.txt was copied and pasted on a separate file for future reference/checking

Finding the Complementary Strand

  • sed "y/atcg/tagc/" was found to replace all lowercase a's, t's, c's, and g's with t's, a's, g's, and c's respectively (in lines of text)
  • Using prokarote.txt, the given nucleotide sequence, the complementary strand was found by using cat prokaryote.txt | sed "y/atcg/tagc/"
  • The given nucleotide sequence was:
  • The complementary strand, using cat prokaryote.txt | sed "y/atcg/tagc/", was found to be:

Finding the 6 Reading Frames of prokaryote.txt

Initial Findings

  • While still in ~dondi/xmlpipedb/data, genetic-code.sed was examined using cat genetic-code.sed
  • genetic-code.sed was found to contain all of the sed replacement commands needed to convert any mRNA triplet to an amino acid
  • The large amount of sed replacement commands in genetic-code.sed made it apparent that linking them all together in one pipeline would be difficult and tedious. All of genetic-code.sed, ideally, would be exploited in one command
  • cat prokaryote.txt | sed "s/^.//g" was found to remove the first letter from the nucleotide sequence
  • cat prokaryote.txt | sed "s/^..//g" was found to remove the first two letters from the nucleotide sequence
  • cat prokaryote.txt | sed "s/.../ & /g" was found to make the nucleotide sequence a set of triplets, with spaces inbetween each
  • rev prokaryote.txt was found to reverse the sequence (changes the direction from 5' - 3' to 3' - 5', or vice versa)
  • It was assumed that the sequence in prokaryote.txt ran from 5' to 3'
  • sed "s/[atcg]//g" was found to delete any uncapitalized nucleotide sequence letters
  • sed "y/t/u/" was found to replace any uncapitalized t's with u's; would be useful in converting a nucleic acid sequence into RNA
  • It was realized that a file with a set of sed commands could be exploited by using sed -f <filename>; would be a good pairing with genetic-code.sed!

Finding the Reading Frames of the mRNA-like strand (5'-3')

  • +1 reading frame was found by using: cat prokaryote.txt | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
  • +2 reading frame was found by using: cat prokaryote.txt | sed "s/^.//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
  • +3 reading frame was found by using: cat prokaryote.txt | sed "s/^..//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"

Finding the Reading Frames of the template strand (3'-5')

  • -1 reading frame was found by using: cat prokaryote.txt | rev prokaryote.txt | sed "y/atcg/tagc/" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
  • -2 reading frame was found by using: cat prokaryote.txt | rev prokaryote.txt | sed "y/atcg/tagc/" | sed "s/^.//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
  • -3 reading frame was found by using: cat prokaryote.txt | rev prokaryote.txt | sed "y/atcg/tagc/" | sed "s/^..//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"

Checking Results

  • Using the ExPASy Translate Tool, tctactatatttcaataggtacgatggccaaagaagacaatattgaacttgaaacgttgcctaataccatgttccgcgtataacccagccgccagttccgctggcggcattttaac was entered and converted into the possible sequences of amino acids (output format was selected as compact). The 6 reading frames, as given by this tool, matched those found in the assignment

XMLPipeDB Match Practice


  • The program xmlpipedb-match-1.1.1.jar was found in ~dondi/xmlpipedb/data
  • It was found that java programs can be run by using java -jar <program name>
  • xmlpipedb-match-1.1.1.jar would be run, for the purpose of matching patterns, by using java -jar xmlpipedb-match-1.1.1.jar <pattern> < <filename>
  • 493.P_falciparum.xml was found in ~dondi/xmlpipedb/data and examined using cat 493.P_falciparum.xml; it took quite some time to fully load (viewing using more seems like a good idea)

Working with XMLPipeDB Match

  1. Match command for the tallying of the occurrences of the pattern GO:000[567] in 493.P_falciparum.xml
    • java -jar xmlpipedb-match-1.1.1.jar GO:000[567] < 493.P_falciparum.xml can be used to match occurrences of GO:0005, GO:0006, and GO:0007
    • 3 total unique matches were found: GO:0005, GO:0006, and GO:0007
    • Occurrences of each unique match: 113 for GO:0007, 1100 for GO:0006, and 1371 for GO:0005
  2. Observing "in situ" occurences of GO:000[567] in 493.P_falciparum.xml
    • more 493.P_falciparum.xml was used to make the viewing of the file more manageable
    • While in more, by typing /GO:0006 and pressing enter, a line containing the pattern GO:0006 was present at the top of the window (surrounded by the file's text)
    • Based on the surrounding text, the pattern likely represents the beginning portion of an ID string tied to various genes in a gene database for Plasmodium falciparum
    • In the text, it was found that various processes/metabolic pathways are connected to each database ID string (likely influenced by the genes in question)
  3. Match command for the tallying of the occurrences of the pattern \"Yu.*\" in 493.P_falciparum.xml
    • 3 total unique matches were found: "yu b.", "yu k.", and "yu m."
    • Occurrences of each unique match: 1 for "yu b.", 228 for "yu k.", and 1 for "yu m.".
    • I'm fairly certain that this pattern represents a person's name. By using more 493.P_falciparum.xml, and typing /\"Yu.*\", an example of an in-text line containing this pattern was found. It was observed that this pattern is preceded by <person name=
  4. Using Match and grep + wc to count occurences of the pattern ATG in hs_ref_GRCh37_chr19.fa
    • hs_ref_GRCh37_chr19.fa was found in ~dondi/xmlpipedb/data
    • java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa was employed to find the instances of ATG via Match
      • Output: 1 unique match, atg, was found. There are 830101 instances of atg in the file
    • grep "ATG" hs_ref_GRCh37_chr19.fa | wc was used to find the instances of ATG using grep + wc
      • Output: 502410 lines, 502410 words, and 35671048 characters (the output of grep + wc is unlabeled, it is always lines, words, and characters from left to right)
    • There is a large difference between the outputs of Match and grep + wc in regards to finding the occurrences of ATG. This big difference is due to the fact that Match finds specific instances of the ATG pattern (possibly several in a line) while grep + wc just finds lines that contain at least one instance of ATG and counts those lines. grep + wc treats lines and words as the same since it sees the output lines (of grep) as words (there are no spaces/breaks within each individual line)

Brandon Litvak
BIOL 367, Fall 2015

Weekly Assignments Individual Journal Pages Shared Journal Pages