Difference between revisions of "Kzebrows Week 8"

Revision as of 21:46, 25 October 2015

Electronic Lab Notebook

Statistical Analysis of Vibrio cholerae Microarray Data

The instructions below are adapted from the Sample Microarray Analysis page and the Protocols page. This page is hosted by OpenWetWare.org.

Normalize the log ratios for the set of slides in the experiment

First, I scaled and centered the data. I inserted a new Worksheet into my excel file and named it "scaled_centered". I went back to the "compiled_raw_data" worksheet, selected All and Copy, and went to the new "scaled_centered" worksheet. I copy and pasted by clicking on the upper hand left cell (A1) and pasting. I then inserted two rows in between the top row of headers and the first data row and typed "Average" in cell A2 and "StdDev" in type A3. By using the equation =AVERAGE(B4:B5224) and pressing enter in cell B2 and typing STDEV(B4:B5224) in cell B3, I calculated the average and standard deviation of the log ratios for each chip. I then copy and pasted all columns to the right of the last column and renamed them all AI_scaled_centered, A2_scaled_centered, etc. In cell N4 I typed the equation =(B4-B$2)/(B$3) which is the log change minus the mean divided by the st. dev. The mean and standard deviation values both have dollar signs because we did not want to change the reference mean and standard deviation. I copy and pasted this into the entire column using ctrl + copy and double clicking on the corner to copy the formula across the column, a trick that Dr. Dahlquist taught us that I ended up using often in this assignment. I then used the same scaling and centering equation in all "scaled_centered" columns.

Perform statistical analysis on the ratios

I inserted a new worksheet and named it "statistics" and copied the first ID column from the scaling_centering worksheet. I pasted the data into the new Statistics worksheet and then went back and copied all columns designated "scaled_centered" and pasted them as values into the new worksheet (starting in cell B1). I deleted rows 1 and 2 (average and standard deviation) and added a new column on the right with the headers "Avg_LogFC_A" and then two more columns with the same formula for B and C. I computed the average log fold change for each patient in each of these columns. Then, I computed an average of the patients' averages in the next column called "Avg_LogFC_all" all. I then inserted a new column next to that column and called it Tstat, into which I entered the equation =Average(N2:P2)/(STDEV(N2:P2)/SQRT(3)) and I copied and pasted this into all rows of the column. In this case, the number of replicates was 3. The next column I labeled "Pvalue". I entered the equation <code>=TDIST(ABS(R2),2,2) where the middle 2 is degrees of freedom. I copied and pasted it into all rows of the column.

Calculating the Bonferroni p value Correction

Next, to calculate the Bonferroni p value, I labeled the next two right columns Bonferroni and Bonferroni_Pvalue. I then typed the equation =S2 x 5221 into the column and replaced any corrected p value greater than 1 by the number 1 by using the formula =IF(T2>1,1,T2) in the Bonferroni_Pvalue column.

Calculate the Benjamini & Hochberg p value Correction

I inserted a new worksheet named B-H_Pvalue and copy and pasted the ID column from the statistics worksheet into this one. I inserted a new column on the left and named it MasterIndex. I then typed a 1 in cell A2 and a 2 in cell A3 and selected both cells. By double-clicking the + sign at the bottom right I filled the entire column with a series of numbers. I copied the unadjusted Pvalues from the previous worksheet and pasted them into Column C. I selected columns A, B, and C, and sorted them by ascending value. I ten typed Rank into cell D1 and ranked them from 1 to 5221 into this new column. To calculate the B-H p value I typed the equation =(C2*5221)/D2 into cell E1 and named the column B-H_Pvalue. In cell F2, I typed the formula =IF(E2>1,1,E2) and copied the equation in all of column F under the column heading B-H_Pvalue. I then sorted columns A through F in ascending order. I copied only column F, the B-H P value, and pasted the values into the next column on the right of the statistics worksheet.

Prepare File for GenMAPP

I inserted a new worksheet and named it forGenMAPP. I selected all from the statistics worksheet and copied the values and pasted them into the new worksheet. I selected columns B through Q and formatted them to 2 decimal places. I then selected all p value columns and formatted them to 4 decimal places. I deleted the left-most Bonferroni p value column and inserted a column to the right of the ID column, naming it SystemCode, and filled each column with the letter N.

I saved this as a txt file and as an Excel file in class. NOTE: Unfortunately, only my text file saved accurately. Because I used the txt file for GenMAPP, i was able to carry on for Part 2 of this assignment; however, I had to re-do parts of my Excel file and re-upload it. Below are the text file that I used and the original Excel file, which did not save correctly, as well as the re-done Excel spreadsheet.

File:Kzebrows microarray20151020.EX.txt

File:Kzebrows microarray20151020.xls

File:Kzebrows microarrayanalysis20151025.xls

Sanity Check: Number of genes significantly changed

In the p value column I filtered the data so the P value is less than 0.05

How many genes have p value <0.05? What is the percentage? 948 genes or 18.16%
What about p<0.01? What is the percentage? 235 or 4.5%
What about p<0.001? What is the percentage? 24 or 0.46%
What about p<0.0001? What is the percentage? 2 or 0.04%

Then I filtered the data to determine the following:

How many genes are p<0.05 for the Bonferroni-corrected p-value? What is the percentage? 0, or 0%
How many genes are p<0.05 for the Benjamini and Hochberg-corrected p value? What is the percentage? 0, or 0%

I experimented with different filters for the Avg_LogFC_all column.

Keeping the unadjusted Pvalue filter at p<0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there? 352, or 6.7%
Keeping the unadjusted Pvalue filter at p<0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there? 596, or 11.4%
What about an average log fold change of >0.25 and p<0.05? 339, or 6.5%
Or an average log fold change of <-0.25 and p<0.05? 579, or 11.01%

What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method? Merrell et al. used a two-class SAM analysis (Statistical Analysis for Microarrays). Genes were considered significantly changed if there was at least a twofold change in levels of expression. We used p-values to determine significant gene expression changes, or the probability that the changes in expression levels could be due to chance, while they used the actual levels of change to determine significance. We used the fold change cut-off with a range of 0.5 (-0.25 to +0.25) to include more genes in our analysis (918 total) while Merrell's analysis was more stringent (they only found 237 genes that were statistically significant, indicating differential regulation).

Sanity Check: Compare individual genes with known data

In the next part of the sanity check, I looked up the following genes in my spreadsheet to compare them to the analysis done by Merrell et al. (2002), with the objective of finding their fold changes and p values and finding if they were significantly changed in my analysis. Fold change is column R and p value is column T.

VC0028 Two entries
- Fold change: 1.27 for the first, 1.65 for the second
- P value: 0.0692 for the first, 0.0474 for the second
- Not significantly changed for the first, significant for the second.

VC0941 Two entries
- Fold change: 1) -0.28 and 2) 0.09
- P value: 1) 0.1636 and 2) 0.6759
- Not significantly changed
VC0869 Five entries
- Fold changes: 2.12, 1.50, 1.59, 1.95, 2.20
- P value: 0.02, 0.0174, 0.0463, 0.0227, 0.002
- All are significantly changed
VC0051
- Fold change: 1.89, 1.92
- P value: 0.016, 0.0139
- Both are significantly changed
VC0647 I have 3 entries for this ID.
- Fold change: -1.11, -0.94, -1.05
- P value: 0.0003, 0.0125, 0.0051
- All significantly changed
VC0468
- Fold change: -0.17
- P value: 0.3350
- Not significantly changed
VC2350
- Fold change:
- P value:
VCA0583
- Fold change:
- P value:

Part 2

I did 2009 and got 772 errors.
Mary did 2010 and got 121 errors.

List the top 10 gene ontology terms. Compare your list with your partner.

Cellular macromolecule metabolic process
Macromolecule metabolic process
Localization
Macromolecule biosynthetic process
Transporter activity
Biopolymer metabolic process
Cell projection organization
Cellular biopolymer metabolic process
Cellular biopolymer biosynthetic process
Cellular macromolecule biosynthetic process

Our databases were completely different with no overlap. There may have been significant discoveries in 2009 that resulted in significant gene changes for the 2010 upload.

List the GO terms associated with each of the genes. Are they the same as your partner's? Why or why not?

VC0028: No entries
VC0941: No entries
VC0869: No entries
VC0051: No entries
VC0647: mRNA catabolic process, RNA processing, cytoplasm, RNA binding, 3'-5' exoribonuclease activity, transferase activity, nucleotidyltransferase activity
VC0468: No entries
VC2350: No entries
VCA0583: Transport, outer membrane-bounded periplasmic space, transporter activity

List in your journal entry the name of the GO term you clicked on and whether the expression of the gene you were looking for changed significantly in the experiment.

I clicked on 3'-5' exoribonuclease activity. This protein is known as polyribonucleotide nucleotidyltransferase (PNP_VIBCH) and it is involved in degrading mRNA. To do this, it catalyzes the phosphorolysis of polyribonucleotides that are single-stranded from 3' to 5'. Expression of this gene decreased significantly.

File:Kzebrows3’-5’-exoribonuclease activity.mapp

I set the following filters on the spreadsheet for a total of 22 results:

Z score greater than 2
PermuteP less than 0.05
Number changed greater than or equal to 5 and less than 100
Percent change greater than or equal to 21

Are any of your filtered GO terms closely related to one another, meaning are they a direct child or parent to another term in the list? You can judge this by comparing your spreadsheet with the MAPPFinder browser. Highlight the terms that fit this relationship with the same color in your Excel spreadsheet. Upload your .xls file to your journal page. Yes, there were several related files. Please see the filtered and highlighted Excel file for results.

Interpret your results. Look up the definitions for any GO terms that are unfamiliar to you. The "official" definitions for GO terms can be found at http://www.geneontology.org. You can use one of the online biological dictionaries as a supplement, if needed. Write a paragraph relating the results of this GO analysis to the experiment performed (comparing laboratory-grown and patient-derived Vibrio cholerae. You need to give a biological interpretation of what do each of these GO terms in your filtered list have to to with the pathogenecity of the bacterium? You may consult with your partner on this, but your explanation on your individual journal page needs to be in your own words. This is where the real "brain power" comes in with interpreting DNA microarray data. Even experienced scientists struggle with this part. Use your creativity as a scientist to stretch your brain in this question.

There is one other file you need to save to your journal page. It has a .gmf extension and should be in the same fold as the .gex file that you created with the GenMAPP Expression Dataset Manager. You will need this file to re-open your results in MAPPFinder.

Files

File:Kzebrows microarray20151020.gex

@@ Line 117: / Line 117: @@
 '''List in your journal entry the name of the GO term you clicked on and whether the expression of the gene you were looking for changed significantly in the experiment.'''
-*I clicked on 3'-5' exoribonuclease activity. PNP_VIBCH decreased significantly.
+*I clicked on 3'-5' exoribonuclease activity. This protein is known as polyribonucleotide nucleotidyltransferase (PNP_VIBCH) and it is involved in degrading mRNA. To do this, it catalyzes the phosphorolysis of polyribonucleotides that are single-stranded from 3' to 5'. Expression of this gene decreased significantly.
+[[File:Kzebrows3’-5’-exoribonuclease activity.mapp]]
 I set the following filters on the spreadsheet for a total of 22 results:

Difference between revisions of "Kzebrows Week 8"

Revision as of 21:46, 25 October 2015

Contents

Electronic Lab Notebook

Normalize the log ratios for the set of slides in the experiment

Perform statistical analysis on the ratios

Calculating the Bonferroni p value Correction

Calculate the Benjamini & Hochberg p value Correction

Prepare File for GenMAPP

Sanity Check: Number of genes significantly changed

Sanity Check: Compare individual genes with known data

Part 2

Files

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools