Blitvak Week 3
From LMU BioDB 2015
Revision as of 00:38, 21 September 2015 by Blitvak (Talk | contribs) (minor edit to initial preparation text)
Contents
Individual Journal Assignment Week 3
Initial Preparations
- PuTTY was downloaded, installed, and initialized
- Connected to my.cs.lmu.edu workstation via PuTTY
- Entered ~dondi/xmlpipedb/data using
cd ~dondi/xmlpipedb/data - All results were checked using the ExPASy Translate Tool and the Nucleic Acid Sequence Massager, provided by Attotron
- prokaryote.txt in ~dondi/xmlpipedb/data was examined using
cat prokaryote.txt - prokaryote.txt was chosen for use in the first part of this assignment
- The sequence in prokaryote.txt was copied and pasted on a separate file for future reference and checking
Finding the Complementary Strand
sed "y/atcg/tagc/"was found to replace all lowercase a's, t's, c's, and g's with t's, a's, g's, and c's respectively (in lines of text)- Using prokarote.txt, the given nucleotide sequence, the complementary strand was found by using
cat prokaryote.txt | sed "y/atcg/tagc/" - The given nucleotide sequence was:
tctactatatttcaataggtacgatggccaaagaagacaatattgaacttgaaacgttgcctaataccatgttccgcgtataacccagccgccagttccgctggcggcattttaac
- The complementary strand, using
cat prokaryote.txt | sed "y/atcg/tagc/", was found to be:
agatgatataaagttatccatgctaccggtttcttctgttataacttgaactttgcaacggattatggtacaaggcgcatattgggtcggcggtcaaggcgaccgccgtaaaattg
- This result was confirmed by the Nucleic Acid Sequence Massager
Finding the 6 Reading Frames of prokaryote.txt
Initial Findings
- While still in ~dondi/xmlpipedb/data, genetic-code.sed was examined using
cat genetic-code.sed - genetic-code.sed was found to contain all of the sed replacement commands needed to convert any mRNA triplet to an amino acid
- The large amount of sed replacement commands in genetic-code.sed made it apparent that linking them all together in one pipeline would be difficult and tedious. All of genetic-code.sed, ideally, would be exploited in one command
cat prokaryote.txt | sed "s/^.//g"was found to remove the first letter from the nucleotide sequencecat prokaryote.txt | sed "s/^..//g"was found to remove the first two letters from the nucleotide sequencecat prokaryote.txt | sed "s/.../ & /g"was found to make the nucleotide sequence a set of triplets, with spaces inbetween eachrev prokaryote.txtwas found to reverse the sequence (changes the direction from 5' - 3' to 3' - 5', or vice versa)- It was assumed that the sequence in prokaryote.txt ran from 5' to 3'
sed "s/[atcg]//g" was found to delete any uncapitalized nucleotide sequence letterssed "y/t/u/"was found to replace any uncapitalized t's with u's; would be useful in converting a nucleic acid sequence into RNA- It was realized that a file with a set of sed commands could be exploited by using
sed -f <filename>; would be a good pairing with genetic-code.sed!
Finding the Reading Frames of the mRNA-like strand (5'-3')
- +1 reading frame was found by using:
cat prokaryote.txt | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"- Output:
STIFQ-VRWPKKTILNLKRCLIPCSAYNPAASSAGGIL
- Output:
- +2 reading frame was found by using:
cat prokaryote.txt | sed "s/^.//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"- Output:
LLYFNRYDGQRRQY-T-NVA-YHVPRITQPPVPLAAF-
- Output:
- +3 reading frame was found by using:
cat prokaryote.txt | sed "s/^..//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"- Output:
YYISIGTMAKEDNIELETLPNTMFRV-PSRQFRWRHFN
- Output:
Finding the Reading Frames of the template strand (3'-5')
- -1 reading frame was found by using:
cat prokaryote.txt | rev prokaryote.txt | sed "y/atcg/tagc/" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"- Output:
VKMPPAELAAGLYAEHGIRQRFKFNIVFFGHRTY-NIV
- Output:
- -2 reading frame was found by using:
cat prokaryote.txt | rev prokaryote.txt | sed "y/atcg/tagc/" | sed "s/^.//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"- Output:
LKCRQRNWRLGYTRNMVLGNVSSSILSSLAIVPIEI--
- Output:
- -3 reading frame was found by using:
cat prokaryote.txt | rev prokaryote.txt | sed "y/atcg/tagc/" | sed "s/^..//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"- Output:
-NAASGTGGWVIRGTWY-ATFQVQYCLLWPSYLLKYSR
- Output:
Checking Results
- Using the ExPASy Translate Tool,
tctactatatttcaataggtacgatggccaaagaagacaatattgaacttgaaacgttgcctaataccatgttccgcgtataacccagccgccagttccgctggcggcattttaacwas entered and converted into the possible sequences of amino acids (output format was selected as compact). The 6 reading frames, as given by this tool, matched those found in the assignment
XMLPipeDB Match Practice
Preparations
- The program xmlpipedb-match-1.1.1.jar was found in ~dondi/xmlpipedb/data
- It was found that java programs can be run by using
java -jar <program name> - xmlpipedb-match-1.1.1.jar would be run, for the purpose of matching patterns, by using
java -jar xmlpipedb-match-1.1.1.jar <pattern> < <filename> - 493.P_falciparum.xml was found in ~dondi/xmlpipedb/data and examined using
cat 493.P_falciparum.xml; it took quite some time to fully load (viewing using more seems like a good idea)
Working with XMLPipeDB Match
- Match command for the tallying of the occurrences of the pattern
GO:000[567]in 493.P_falciparum.xmljava -jar xmlpipedb-match-1.1.1.jar GO:000[567] < 493.P_falciparum.xmlcan be used to match occurrences ofGO:0005,GO:0006, andGO:0007- 3 total unique matches were found:
GO:0005,GO:0006, andGO:0007 - Occurrences of each unique match: 113 for
GO:0007, 1100 forGO:0006, and 1371 forGO:0005
- Observing "in situ" occurences of
GO:000[567]in 493.P_falciparum.xmlmore 493.P_falciparum.xmlwas used to make the viewing of the file more manageable- While in more, by typing /GO:0006 and pressing enter, a line containing the pattern
GO:0006was present at the top of the window (surrounded by the file's text) - Based on the surrounding text, the pattern likely represents the beginning portion of an ID string tied to various genes in a gene database for Plasmodium falciparum
- In the text, it was found that various processes/metabolic pathways are connected to each database ID string (likely influenced by the genes in question)
- Match command for the tallying of the occurrences of the pattern
\"Yu.*\"in 493.P_falciparum.xml- 3 total unique matches were found:
"yu b.","yu k.", and"yu m." - Occurrences of each unique match: 1 for
"yu b.", 228 for"yu k.", and 1 for"yu m.". - I'm fairly certain that this pattern represents a person's name. By using
more 493.P_falciparum.xml, and typing /\"Yu.*\", an example of an in-text line containing this pattern was found. It was observed that this pattern is preceded by<person name=
- 3 total unique matches were found:
- Using Match and grep + wc to count occurences of the pattern
ATGin hs_ref_GRCh37_chr19.fa- hs_ref_GRCh37_chr19.fa was found in ~dondi/xmlpipedb/data
java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fawas employed to find the instances ofATGvia Match- Output: 1 unique match,
atg, was found. There are 830101 instances ofatgin the file
- Output: 1 unique match,
grep "ATG" hs_ref_GRCh37_chr19.fa | wcwas used to find the instances ofATGusing grep + wc- Output: 502410 lines, 502410 words, and 35671048 characters (the output of grep + wc is unlabeled, it is always lines, words, and characters from left to right)
- There is a large difference between the outputs of Match and grep + wc in regards to finding the occurrences of
ATG. This big difference is due to the fact that Match finds specific instances of theATGpattern (possibly several in a line) while grep + wc just finds lines that contain at least one instance ofATGand counts those lines. grep + wc treats lines and words as the same since it sees the output lines (of grep) as words (there are no spaces/breaks within each individual line)
Brandon Litvak
BIOL 367, Fall 2015
| Weekly Assignments | Individual Journal Pages | Shared Journal Pages |
|---|---|---|
|
|
|
|