Kzebrows Week 8

Electronic Lab Notebook

Statistical Analysis of Vibrio cholerae Microarray Data

The instructions below are adapted from the Sample Microarray Analysis page and the Protocols page. This page is hosted by OpenWetWare.org.

Normalize the log ratios for the set of slides in the experiment

First, I scaled and centered the data. I inserted a new Worksheet into my excel file and named it "scaled_centered". I went back to the "compiled_raw_data" worksheet, selected All and Copy, and went to the new "scaled_centered" worksheet. I copy and pasted by clicking on the upper hand left cell (A1) and pasting. I then inserted two rows in between the top row of headers and the first data row and typed "Average" in cell A2 and "StdDev" in type A3. By using the equation =AVERAGE(B4:B5224) and pressing enter in cell B2 and typing STDEV(B4:B5224) in cell B3, I calculated the average and standard deviation of the log ratios for each chip. I then copy and pasted all columns to the right of the last column and renamed them all AI_scaled_centered, A2_scaled_centered, etc. In cell N4 I typed the equation =(B4-B$2)/(B$3) which is the log change minus the mean divided by the st. dev. The mean and standard deviation values both have dollar signs because we did not want to change the reference mean and standard deviation. I copy and pasted this into the entire column using ctrl + copy and double clicking on the corner to copy the formula across the column, a trick that Dr. Dahlquist taught us that I ended up using often in this assignment. I then used the same scaling and centering equation in all "scaled_centered" columns.

Perform statistical analysis on the ratios

I inserted a new worksheet and named it "statistics" and copied the first ID column from the scaling_centering worksheet. I pasted the data into the new Statistics worksheet and then went back and copied all columns designated "scaled_centered" and pasted them as values into the new worksheet (starting in cell B1). I deleted rows 1 and 2 (average and standard deviation) and added a new column on the right with the headers "Avg_LogFC_A" and then two more columns with the same formula for B and C. I computed the average log fold change for each patient in each of these columns. Then, I computed an average of the patients' averages in the next column called "Avg_LogFC_all" all. I then inserted a new column next to that column and called it Tstat, into which I entered the equation =Average(N2:P2)/(STDEV(N2:P2)/SQRT(3)) and I copied and pasted this into all rows of the column. In this case, the number of replicates was 3. The next column I labeled "Pvalue". I entered the equation <code>=TDIST(ABS(R2),2,2) where the middle 2 is degrees of freedom. I copied and pasted it into all rows of the column.

Calculating the Bonferroni p value Correction

Next, to calculate the Bonferroni p value, I labeled the next two right columns Bonferroni and Bonferroni_Pvalue. I then typed the equation =S2 x 5221 into the column and replaced any corrected p value greater than 1 by the number 1 by using the formula =IF(T2>1,1,T2) in the Bonferroni_Pvalue column.

Calculate the Benjamini & Hochberg p value Correction

I inserted a new worksheet named B-H_Pvalue and copy and pasted the ID column from the statistics worksheet into this one. I inserted a new column on the left and named it MasterIndex. I then typed a 1 in cell A2 and a 2 in cell A3 and selected both cells. By double-clicking the + sign at the bottom right I filled the entire column with a series of numbers. I copied the unadjusted Pvalues from the previous worksheet and pasted them into Column C. I selected columns A, B, and C, and sorted them by ascending value. I ten typed Rank into cell D1 and ranked them from 1 to 5221 into this new column. To calculate the B-H p value I typed the equation =(C2*5221)/D2 into cell E1 and named the column B-H_Pvalue. In cell F2, I typed the formula =IF(E2>1,1,E2) and copied the equation in all of column F under the column heading B-H_Pvalue. I then sorted columns A through F in ascending order. I copied only column F, the B-H P value, and pasted the values into the next column on the right of the statistics worksheet.

Prepare File for GenMAPP

I inserted a new worksheet and named it forGenMAPP. I selected all from the statistics worksheet and copied the values and pasted them into the new worksheet. I selected columns B through Q and formatted them to 2 decimal places. I then selected all p value columns and formatted them to 4 decimal places. I deleted the left-most Bonferroni p value column and inserted a column to the right of the ID column, naming it SystemCode, and filled each column with the letter N.

I saved this as a txt file and as an Excel file in class. NOTE: Unfortunately, only my text file saved accurately. Because I used the txt file for GenMAPP, i was able to carry on for Part 2 of this assignment; however, I had to re-do parts of my Excel file and re-upload it. Below are the text file that I used and the original Excel file, which did not save correctly, as well as the re-done Excel spreadsheet.

File:Kzebrows microarray20151020.EX.txt

File:Kzebrows microarray20151020.xls

File:Kzebrows microarrayanalysis20151025.xls

File:Kzebrows microarray20151020.gex (Expression Dataset file)

Sanity Check: Number of genes significantly changed

In the p value column I filtered the data so the P value is less than 0.05

How many genes have p value <0.05? What is the percentage? 948 genes or 18.16%
What about p<0.01? What is the percentage? 235 or 4.5%
What about p<0.001? What is the percentage? 24 or 0.46%
What about p<0.0001? What is the percentage? 2 or 0.04%

Then I filtered the data to determine the following:

How many genes are p<0.05 for the Bonferroni-corrected p-value? What is the percentage? 0, or 0%
How many genes are p<0.05 for the Benjamini and Hochberg-corrected p value? What is the percentage? 0, or 0%

I experimented with different filters for the Avg_LogFC_all column.

Keeping the unadjusted Pvalue filter at p<0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there? 352, or 6.7%
Keeping the unadjusted Pvalue filter at p<0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there? 596, or 11.4%
What about an average log fold change of >0.25 and p<0.05? 339, or 6.5%
Or an average log fold change of <-0.25 and p<0.05? 579, or 11.01%

What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method? Merrell et al. used a two-class SAM analysis (Statistical Analysis for Microarrays). Genes were considered significantly changed if there was at least a twofold change in levels of expression. We used p-values to determine significant gene expression changes, or the probability that the changes in expression levels could be due to chance, while they used the actual levels of change to determine significance. We used the fold change cut-off with a range of 0.5 (-0.25 to +0.25) to include more genes in our analysis (918 total) while Merrell's analysis was more stringent (they only found 237 genes that were statistically significant, indicating differential regulation).

Sanity Check: Compare individual genes with known data

In the next part of the sanity check, I looked up the following genes in my spreadsheet to compare them to the analysis done by Merrell et al. (2002), with the objective of finding their fold changes and p values and finding if they were significantly changed in my analysis. Fold change is column R and p value is column T.

VC0028 Two entries
- Fold change: 1.27 for the first, 1.65 for the second
- P value: 0.0692 for the first, 0.0474 for the second
- Not significantly changed for the first, significant for the second.

VC0941 Two entries
- Fold change: 1) -0.28 and 2) 0.09
- P value: 1) 0.1636 and 2) 0.6759
- Not significantly changed
VC0869 Five entries
- Fold changes: 2.12, 1.50, 1.59, 1.95, 2.20
- P value: 0.02, 0.0174, 0.0463, 0.0227, 0.002
- All are significantly changed
VC0051
- Fold change: 1.89, 1.92
- P value: 0.016, 0.0139
- Both are significantly changed
VC0647 I have 3 entries for this ID.
- Fold change: -1.11, -0.94, -1.05
- P value: 0.0003, 0.0125, 0.0051
- All significantly changed
VC0468
- Fold change: -0.17
- P value: 0.3350
- Not significantly changed
VC2350
- Fold change:
- P value:
VCA0583
- Fold change:
- P value:

Part 2

Following instructions found on GenMAPP and MAPPFinder protocols I downloaded the gene database for V. cholerae for 2009. I saved the file into the folder C:\GenMAPP 2 Data\Gene Databases and unzipped it to extract the files. Next I launched the GenMAPP program and selected the Data Menu, from which I chose Expression Dataset Manager. I selected the tab-delimited file that I formatted for GenMAPP and allowed the Expression Dataset Manager to convert the data. A messaged appeared letting me know how many errors occurred, meaning that the manager could not convert those lines of data. I got 772 errors, while my partner Mary got 121 errors for the 2010 file. In the text file, the errors appeared as "Gene not found in OrderedLocusNames or any related system." It's possible that because my database was older than Mary's I would have more errors. The 2010 file was updated more recently and so names could have been changed, information added, or faulty information deleted since my file had been uploaded.

Next I customized the new Expression Dataset by creating new Color Sets that contained instructions to GenMAPP for displaying data. I created this by filling in different fields in the color set area of the Data Set manager. I named the color set "LogFoldChange" and created two criteria, increased and decreased. The Gene Value was "Avg_LogFC_all" for the dataset. The criteria were established using the formula [Avg_LogFC_All] > 0.25 AND [Pvalue] < 0.05 and [Avg_LogFC_All] < -0.25 AND [Pvalue] < 0.05. I coded "increased" as pink and "decreased" as purple. I then saved the data set and exited the Expression Dataset.

MappFinder Procedure

Mary and I signed up for the "increased" criterion.

I launched the MAPPFinder program and checked that the right name of the Gene Database was appearing at the bottom of the window. I clicked "Calculate New Results" and "Find File." I chose my Expression Dataset file and clicked OK. I chose the "increased" criteria and checked the "gene ontology" and "p value" boxes and clicked Browse to create a filename for my results.

I then ran MAPPFinder, which led me to a browser showing me all of my results. I clicked "Show Ranked List" to see the most significant Gene Ontology terms.

List the top 10 gene ontology terms. Compare your list with your partner.

Cellular macromolecule metabolic process
Macromolecule metabolic process
Localization
Macromolecule biosynthetic process
Transporter activity
Biopolymer metabolic process
Cell projection organization
Cellular biopolymer metabolic process
Cellular biopolymer biosynthetic process
Cellular macromolecule biosynthetic process

Our databases were completely different with no overlap. There may have been significant discoveries in 2009 that resulted in significant gene changes for the 2010 upload.

Next I needed to find the Gene Ontology terms with which a certain gene was associated. I clicked "Collapse the Tree" and searched for the genes mentioned by Merrell et. al (2002), VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583. I typed the identifier for the first gene, VC0028, and came up with no entries for that gene. I went through the list until I found one with entries, VC0647.

List the GO terms associated with each of the genes. Are they the same as your partner's? Why or why not?

VC0028: No entries
VC0941: No entries
VC0869: No entries
VC0051: No entries
VC0647: mRNA catabolic process, RNA processing, cytoplasm, RNA binding, 3'-5' exoribonuclease activity, transferase activity, nucleotidyltransferase activity
VC0468: No entries
VC2350: No entries
VCA0583: Transport, outer membrane-bounded periplasmic space, transporter activity

STILL NEED TO DO

List in your journal entry the name of the GO term you clicked on and whether the expression of the gene you were looking for changed significantly in the experiment.

I clicked on 3'-5' exoribonuclease activity. This protein is known as polyribonucleotide nucleotidyltransferase (PNP_VIBCH) and it is involved in degrading mRNA. To do this, it catalyzes the phosphorolysis of polyribonucleotides that are single-stranded from 3' to 5'. Expression of this gene decreased significantly.

File:Kzebrows3’-5’-exoribonuclease activity.mapp

File:Kzebrows microarray20151020.gmf

I set the following filters on the spreadsheet for a total of 22 results:

Z score greater than 2
PermuteP less than 0.05
Number changed greater than or equal to 5 and less than 100
Percent change greater than or equal to 21

Are any of your filtered GO terms closely related to one another, meaning are they a direct child or parent to another term in the list? You can judge this by comparing your spreadsheet with the MAPPFinder browser. Highlight the terms that fit this relationship with the same color in your Excel spreadsheet. Upload your .xls file to your journal page. Yes, there were several related files.

File:Kzebrows microarray20151020-Criterion0-GO.txt (original file)

File:Kzebrows microarray20151020-Criterion0-GO - Copy.xlsx (filtered and highlighted)

Interpret your results. Look up the definitions for any GO terms that are unfamiliar to you. The "official" definitions for GO terms can be found at http://www.geneontology.org. You can use one of the online biological dictionaries as a supplement, if needed. Write a paragraph relating the results of this GO analysis to the experiment performed (comparing laboratory-grown and patient-derived Vibrio cholerae. You need to give a biological interpretation of what do each of these GO terms in your filtered list have to to with the pathogenecity of the bacterium? You may consult with your partner on this, but your explanation on your individual journal page needs to be in your own words. This is where the real "brain power" comes in with interpreting DNA microarray data. Even experienced scientists struggle with this part. Use your creativity as a scientist to stretch your brain in this question.

There is one other file you need to save to your journal page. It has a .gmf extension and should be in the same fold as the .gex file that you created with the GenMAPP Expression Dataset Manager. You will need this file to re-open your results in MAPPFinder.