Difference between revisions of "Blitvak Week 3"
From LMU BioDB 2015
(added some personal details) |
(added ExPASy outputs) |
||
Line 61: | Line 61: | ||
==== Checking Results ==== | ==== Checking Results ==== | ||
− | *Using the [http://web.expasy.org/translate/ ExPASy Translate Tool], <code>tctactatatttcaataggtacgatggccaaagaagacaatattgaacttgaaacgttgcctaataccatgttccgcgtataacccagccgccagttccgctggcggcattttaac</code> was entered and converted into the possible sequences of amino acids (output format was selected as compact). The 6 reading frames, as given by this tool, matched those found in the assignment | + | *Using the [http://web.expasy.org/translate/ ExPASy Translate Tool], <code>tctactatatttcaataggtacgatggccaaagaagacaatattgaacttgaaacgttgcctaataccatgttccgcgtataacccagccgccagttccgctggcggcattttaac</code> was entered and converted into the possible sequences of amino acids (output format was selected as compact). The 6 reading frames, as [[Media:ExPASy_Translate_Tool_Results.txt |given]] by this tool, matched those found in the assignment |
===XMLPipeDB Match Practice=== | ===XMLPipeDB Match Practice=== |
Latest revision as of 03:57, 22 September 2015
Contents
Individual Journal Assignment Week 3
Initial Preparations
- PuTTY was downloaded, installed, and initialized
- Connected to my.cs.lmu.edu workstation via PuTTY
- Entered ~dondi/xmlpipedb/data using
cd ~dondi/xmlpipedb/data
- All results were checked using the ExPASy Translate Tool and the Nucleic Acid Sequence Massager, provided by Attotron
- prokaryote.txt in ~dondi/xmlpipedb/data was examined using
cat prokaryote.txt
- prokaryote.txt was chosen for use in the first part of this assignment
- The sequence in prokaryote.txt was copied and pasted on a separate file for future reference and checking (additionally, I found that text can be copied by highlighting and right-clicking)
- Key goal in this first segment is to find the complementary strand of the sequence in prokaryote.txt. This should be accomplished by utilizing the base pairing rules of A-T and C-G
- Key command in this assignment would be sed; various kinds of pattern replacements, combined together, can prove to be very powerful (should allow me to convert DNA to mRNA, and mRNA to an amino acid sequence)
Finding the Complementary Strand
sed "y/atcg/tagc/"
was found to replace all lowercase a's, t's, c's, and g's with t's, a's, g's, and c's respectively (in lines of text); this command should allow me to find the complementary strand- Using prokarote.txt, the given nucleotide sequence, the complementary strand was found by using
cat prokaryote.txt | sed "y/atcg/tagc/"
- The given nucleotide sequence was:
tctactatatttcaataggtacgatggccaaagaagacaatattgaacttgaaacgttgcctaataccatgttccgcgtataacccagccgccagttccgctggcggcattttaac
- The complementary strand, using
cat prokaryote.txt | sed "y/atcg/tagc/"
, was found to be:
agatgatataaagttatccatgctaccggtttcttctgttataacttgaactttgcaacggattatggtacaaggcgcatattgggtcggcggtcaaggcgaccgccgtaaaattg
- This result was confirmed by the Nucleic Acid Sequence Massager
Finding the 6 Reading Frames of prokaryote.txt
Initial Findings
- While still in ~dondi/xmlpipedb/data, genetic-code.sed was examined using
cat genetic-code.sed
- genetic-code.sed was found to contain all of the sed replacement commands needed to convert any mRNA triplet to an amino acid
- The large amount of sed replacement commands in genetic-code.sed made it apparent that linking them all together in one pipeline would be difficult and tedious. All of genetic-code.sed, ideally, would be exploited in one command
cat prokaryote.txt | sed "s/^.//g"
was found to remove the first letter from the nucleotide sequence (would be useful in finding the +2 and -2 reading frames, as that involves omitting the first sequence letter)cat prokaryote.txt | sed "s/^..//g"
was found to remove the first two letters from the nucleotide sequence (would be useful in finding the +3 and -3 reading frames, as that involves omitting the first two sequence letters)cat prokaryote.txt | sed "s/.../ & /g"
was found to make the nucleotide sequence a set of triplets, with spaces in between each (this makes the codons distinct from each other and allows them to be clear and readable for the program; this should be tied to a use of genetic-code.sed)rev prokaryote.txt
was found to reverse the sequence (changes the direction from 5' - 3' to 3' - 5', or vice versa; this should be useful in making the template strand run from 5' - 3' prior to working with it for its reading frames)- It was assumed that the sequence in prokaryote.txt ran from 5' to 3'
sed "s/[atcg]//g
" was found to delete any uncapitalized nucleotide sequence letters (should allow the removal of any letters that did not form codons and, thus, did not lead to an amino acid)sed "y/t/u/"
was found to replace any uncapitalized t's with u's; would be useful in converting a nucleic acid sequence into RNA- It was realized that a file with a set of sed commands could be exploited by using
sed -f <filename>
; this command should pair well with genetic-code.sed! - I figured out that
sed "s/.../ & /g"
would eventually lead to an amino acid sequence with spaces in between each letter.sed "s/ //g"
was found to delete any spaces (would be good to place it after the codons are converted to a sequence of amino acids)
Finding the Reading Frames of the mRNA-like strand (5'-3')
- +1 reading frame was found by using:
cat prokaryote.txt | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
- Output:
STIFQ-VRWPKKTILNLKRCLIPCSAYNPAASSAGGIL
- Output:
- +2 reading frame was found by using:
cat prokaryote.txt | sed "s/^.//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
- Output:
LLYFNRYDGQRRQY-T-NVA-YHVPRITQPPVPLAAF-
- Output:
- +3 reading frame was found by using:
cat prokaryote.txt | sed "s/^..//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
- Output:
YYISIGTMAKEDNIELETLPNTMFRV-PSRQFRWRHFN
- Output:
Finding the Reading Frames of the template strand (3'-5')
- -1 reading frame was found by using:
cat prokaryote.txt | rev prokaryote.txt | sed "y/atcg/tagc/" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
- Output:
VKMPPAELAAGLYAEHGIRQRFKFNIVFFGHRTY-NIV
- Output:
- -2 reading frame was found by using:
cat prokaryote.txt | rev prokaryote.txt | sed "y/atcg/tagc/" | sed "s/^.//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
- Output:
LKCRQRNWRLGYTRNMVLGNVSSSILSSLAIVPIEI--
- Output:
- -3 reading frame was found by using:
cat prokaryote.txt | rev prokaryote.txt | sed "y/atcg/tagc/" | sed "s/^..//g" | sed "s/.../ & /g" | sed "y/t/u/" | sed -f genetic-code.sed | sed "s/ //g" | sed "s/[atcg]//g"
- Output:
-NAASGTGGWVIRGTWY-ATFQVQYCLLWPSYLLKYSR
- Output:
Checking Results
- Using the ExPASy Translate Tool,
tctactatatttcaataggtacgatggccaaagaagacaatattgaacttgaaacgttgcctaataccatgttccgcgtataacccagccgccagttccgctggcggcattttaac
was entered and converted into the possible sequences of amino acids (output format was selected as compact). The 6 reading frames, as given by this tool, matched those found in the assignment
XMLPipeDB Match Practice
Preparations
- The program xmlpipedb-match-1.1.1.jar was found in ~dondi/xmlpipedb/data
- It was found that java programs can be run by using
java -jar <program name>
- xmlpipedb-match-1.1.1.jar would be run, for the purpose of matching patterns, by using
java -jar xmlpipedb-match-1.1.1.jar <pattern> < <filename>
- 493.P_falciparum.xml was found in ~dondi/xmlpipedb/data and examined using
cat 493.P_falciparum.xml
; it took quite some time to fully load (viewing using more seems like a wonderful idea) - I figured out that there is search function in more, initiated by typing
/<search_text>
and pressing enter
Working with XMLPipeDB Match
- Match command for the tallying of the occurrences of the pattern
GO:000[567]
in 493.P_falciparum.xmljava -jar xmlpipedb-match-1.1.1.jar GO:000[567] < 493.P_falciparum.xml
can be used to match occurrences ofGO:0005
,GO:0006
, andGO:0007
- 3 total unique matches were found:
GO:0005
,GO:0006
, andGO:0007
- Occurrences of each unique match: 113 for
GO:0007
, 1100 forGO:0006
, and 1371 forGO:0005
- Observing "in situ" occurences of
GO:000[567]
in 493.P_falciparum.xmlmore 493.P_falciparum.xml
was used to make the viewing of the file more manageable- While in more, by typing /GO:0006 and pressing enter, a line containing the pattern
GO:0006
was present at the top of the window (surrounded by the file's text) - Based on the surrounding text, the pattern likely represents the beginning portion of an ID string tied to various genes in a gene database for Plasmodium falciparum
- In the text, it was found that various processes/metabolic pathways are connected to each database ID string (likely influenced by the genes in question)
- Match command for the tallying of the occurrences of the pattern
\"Yu.*\"
in 493.P_falciparum.xml- 3 total unique matches were found:
"yu b."
,"yu k."
, and"yu m."
- Occurrences of each unique match: 1 for
"yu b."
, 228 for"yu k."
, and 1 for"yu m."
. - I'm fairly certain that this pattern represents a person's name. By using
more 493.P_falciparum.xml
, and typing /\"Yu.*\", an example of an in-text line containing this pattern was found. It was observed that this pattern is preceded by<person name=
- 3 total unique matches were found:
- Using Match and grep + wc to count occurences of the pattern
ATG
in hs_ref_GRCh37_chr19.fa- hs_ref_GRCh37_chr19.fa was found in ~dondi/xmlpipedb/data
java -jar xmlpipedb-match-1.1.1.jar ATG < hs_ref_GRCh37_chr19.fa
was employed to find the instances ofATG
via Match- Output: 1 unique match,
atg
, was found. There are 830101 instances ofatg
in the file
- Output: 1 unique match,
grep "ATG" hs_ref_GRCh37_chr19.fa | wc
was used to find the instances ofATG
using grep + wc- Output: 502410 lines, 502410 words, and 35671048 characters (the output of grep + wc is unlabeled, it is always lines, words, and characters from left to right)
- There is a large difference between the outputs of Match and grep + wc in regards to finding the occurrences of
ATG
. This big difference is due to the fact that Match finds specific instances of theATG
pattern (possibly several in a line) while grep + wc just finds lines that contain at least one instance ofATG
and counts those lines. grep + wc treats lines and words as the same since it sees the output lines (of grep) as words (there are no spaces/breaks within each individual line)
Brandon Litvak
BIOL 367, Fall 2015
Weekly Assignments | Individual Journal Pages | Shared Journal Pages |
---|---|---|
|
|
|