Kzebrows Week 3

Complement of a Strand

I decided to use the E. coli file for practice in the first part of the assignment. Initially, my instinct was to use SED as the command to replace the letters in the sequence. I needed to replace A with T, T with A, C with G, and G with C; however, I realized that SED only replaces things in a sequence, and if I used SED then every letter that changed would immediately change back to the original, defeating the purpose of the command. I then remembered that I can use sed “y”/<original characters>/<new characters>/ to replace everything in one go.

I opened the prokaryote file using cat infA-E.coli-K12.txt, which gave me the DNA sequence of the mRNA-like strand. If this is read from 5’ to 3’, I needed to create the complementary strand. I then typed in the sed rule indicating what I wanted to replace, which gave me the complementary strand. The complete command was

cat infA-E.coli-K12.txt | sed “y/atcg/tagc/”.

Reading Frames

First I opened the file and replaced all of the T's with U's using

sed "s/t/u/g"

Which gave me the DNA sequence translated into mRNA. This still gave me a long string of letters so I used

sed "s/.../& /g"

to indicate that I wanted a space every three letters, separating the sequence into codons.

Then, from looking at the file genetic-code.sed, which contains a separate list of each codon and the letter of the corresponding amino acid, I knew that this file needed to be added to the list of commands in order for its information to be used with the infA-E.coli-K12.txt file. The final string of commands for the +1 sequence then looks like this:

cat infA-E.coli-K12.txt | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed

For the +2 reading frame I needed to figure out how to delete the first nucleotide in the sequence. I did this by adding the command

sed "s/^.//g"

This command has to come right after the file is opened. Each (.) after the carrot indicates a deletion of one character starting from the beginning of the first line. I also realized that there would be a nucleotide or two left over so I needed to truncate it somehow so only the codons that would be translated into amino acids would show. This is done by using this command:

sed “s/[acug]//g”

I then proceeded with the rest of the commands so the list of commands looked like this:

cat infA-E.coli-K12.txt | sed "s/^.//g" | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed “s/[acug]//g”

To get the +3 reading frame I added one more (.) after the carrot so the command was

sed "s/^..//g"

indicating that I wanted to delete the first TWO characters in the first line. This entire command was:

cat infA-E.coli-K12.txt | sed "s/^..//g" | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed “s/[acug]//g”

Next, I needed to get the -1, -2, and -3 reading frames. To do this I knew I needed to open the file translate the T's to U's, separate the nucleotides into 3-nucleotide codons, and apply the genetic-code.sed file. However, because it is the bottom strand, I also knew I needed to reverse it and find its complement. I added these two commands that differentiate this set of commands from the +1 set.

To reverse the strand so it reads 5' to 3' from left to right:

| rev |

and then to replace the a with t, t with a, c with g, and g with c all in one command (so the sequence didn't just revert to the original)

sed "y/atcg/tagc/"

The final sequence for -1 looked like this

cat infA-E.coli-K12.txt | sed "y/atcg/tagc/" | rev | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[acug]//g"

For the -2 and -3 reading frames I just needed to add the same command sed "s/^.//g" (for -2) and sed "s/^..//g" (for frame -3), with each (.) indicating the deletion of one nucleotide. I inserted this right after reversing the sequence so that it did not mess with the translated sequence, which would affect how the sequence was transcribed later on in the set. For -2 the command set was

cat infA-E.coli-K12.txt | sed "y/atcg/tagc/" | rev | sed "s/^.//g" | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[acug]//g"

For the -3 reading frame the command set was

cat infA-E.coli-K12.txt | sed "y/atcg/tagc/" | rev | sed "s/^..//g" | sed "s/t/u/g" | sed "s/.../& /g" | sed -f genetic-code.sed | sed "s/[acug]//g"

I checked these sequences using the ExPASy Translate tool from the Bioinformatics Resource Portal and verified that they worked correctly.

XML PipeDB Match Utility Practice

What Match command tallies the occurrences of the pattern GO:000[567] in the 493.P_falciparum.xml file? java -jar xmlpipedb-match-1.1.1.jar GO:000[567] < 493.P_falciparum.xml
- How many unique matches are there? 3
- How many times does each unique match appear? The first appears 113 times, the second appears 1,100 times, and the third appears 1,371 times.
Try to find one such occurrence “in situ” within that file. Look at the neighboring content around that occurrence. An example is <dbReference type="GO" id="GO:0005086">.
- Describe how you did this. I searched grep "GO:000[567]" 493.P_falciparum.xml.
- Based on where you find this occurrence, what kind of information does this pattern represent? This information represents an identifier for a term of gene ontology (GO).
What Match command tallies the occurrences of the pattern \"Yu.*\" in the 493.P_falciparum.xml file? java -jar xmlpipedb-match-1.1.1.jar \"Yu.*\" < 493.P_falciparum.xml
- How many unique matches are there? 3
- How many times does each unique match appear?
  - "yu b.": 1
  - "yu k.": 228
  - "yu m.": 1
- What information do you think this pattern represents? I think that this pattern represents people's names, and I can infer that the "Yu K.", whatever that means, probably occurs more frequently than "Yu B." or "Yu M."
Use Match to count the occurrences of the pattern ATG in the hs_ref_GRCh37_chr19.fa file (this may take a while). Then, use grep and wc to do the same thing.
- What answer does Match give you? 830,101
- What answer does grep + wc give you? 502,410
- Explain why the counts are different. (Hint: Make sure you understand what exactly is being counted by each approach.) Grep only counts one match per line. It breaks up the file and counts each line with any "atg" as one match, while Match Utility counts each time "atg" occurs as one match. This is why Match Utility shows far more occurrences than grep.