New Wiki Page
From LMU BioDB 2013
Revision as of 00:26, 19 May 2015 by Kevinmcgee (Talk | contribs)
Contents |
5/18/2015
Microarray Data analysis Workflow
- Set browser to send downloads to Desktop
- Followed the Protocal found on OpenWetWare:
Installing R 3.1.0 and the limma package
The following protocol was developed to normalize GCAT and Ontario DNA microarray chip data from the Dahlquist lab using the R Statistical Software and the limma package (part of the Bioconductor Project).
- The normalization procedure has been verified to work with version 3.1.0 of R released in April 2014 (link to download site) and and version 3.20.1 of the limma package ( direct link to download zipped file) on the Windows 7 platform.
- Note that using other versions of R or the limma package might give different results.
- Note also that using the 32-bit versus the 64-bit versions of R 3.1.0 will give different results for the normalization out in the 10-13 or 10-14 decimal place. The Dahlquist Lab is standardizing on using the 64-bit version of R.
- To install R for the first time, download and run the installer from the link above, accepting the default installation.
- To use the limma package, unzip the file and place the contents into a folder called "limma" in the library directory of the R program. If you accept the default location, that will be C:\Program Files\R\R-3.1.0\library (this will be different on the computers in S120 since you do not have administrator rights).
Running the Normalization Scripts
- Create a folder on your Desktop to store your files for the microarray analysis procedure.
- Download the zipped file that contains the
.gpr
files and save it to this folder (or move it if it saved in a different folder).- Unzip this file using 7-zip. Right-click on the file and select the menu item, "7-zip > Extract Here".
- Download the GCAT_Targets.csv file and Ontario_Targets_wt-dCIN5-dGLN3-dHAP4-dHMO1-dSWI4-dZAP1-Spar_20150514.csv files and save them to this folder (or move them if they saved to a different folder).
- Download the Ontario_Chip_Within-Array_Normalization_modified_20150514.R script and save (or move) it to this folder.
- Download the Within-Array_Normalization_GCAT_and_Merged_Ontario-GCAT_Between-Chip_Normalization_modified_20150514.R script and save (or move) it to this folder.
Within Array Normalization for the Ontario Chips
- Launch R x64 3.1.0 (make sure you are using the 64-bit version).
- Change the directory to the folder containing the targets file and the GPR files for the Ontario chips by selecting the menu item File > Change dir... and clicking on the appropriate directory. You will need to click on the + sign to drill down to the right directory. Once you have selected it, click OK.
- In R, select the menu item File > Source R code..., and select the Ontario_Chip_Within-Array_Normalization_modified_20150514.R script.
- You will be prompted by an Open dialog for the Ontario targets file. Select the file Ontario_Targets_wt-dCIN5-dGLN3-dHAP4-dHMO1-dSWI4-dZAP1-Spar_20150514.csv and click Open.
- Wait while R processes your files.
Within Array Normalization for the GCAT Chips and Between Array Normalization for All Chips
- These instructions assume that you have just completed the Within Array Normalization for the Ontario Chips in the section above.
- In R, select the menu item File > Source R code..., and select the Within-Array_Normalization_GCAT_and_Merged_Ontario-GCAT_Between-Chip_Normalization_modified_20150514.R script.
- You will be prompted by an Open dialog for the GCAT targets file. Select the file GCAT_Targets.csv and click Open.
- Wait while R processes your files.
- When the processing has finished, you will find two files called GCAT_and_Ontario_Within_Array_Normalization.csv and GCAT_and_Ontario_Final_Normalized_Data.csv in the same folder.
- Save these files to LionShare and/or to a flash drive.
Visualizing the Normalized Data
Create MA Plots and Box Plots for the GCAT Chips
Input the following code, line by line, into the main R window. Press the enter key after each block of code.
GCAT.GeneList<-RGG$genes$ID
lg<-log2((RGG$R-RGG$Rb)/(RGG$G-RGG$Gb))
- If you get a message saying "NaNs produced" this is OK, proceed to the next step.
r0<-length(lg[1,]) rx<-tapply(lg[,1],as.factor(GCAT.GeneList),mean) r1<-length(rx) MM<-matrix(nrow=r1,ncol=r0)
for(i in 1:r0) {MM[,i]<-tapply(lg[,i],as.factor(GCAT.GeneList),mean)}
MC<-matrix(nrow=r1,ncol=r0)
for(i in 1:r0) {MC[,i]<-dw[i]*MM[,i]}
MCD<-as.data.frame(MC) colnames(MCD)<-chips rownames(MCD)<-gcatID
la<-(1/2*log2((RGG$R-RGG$Rb)*(RGG$G-RGG$Gb)))
- If you get these Warning messages, it's OK:
- 1: In (RGG$R - RGG$Rb) * (RGG$G - RGG$Gb) :
- NAs produced by integer overflow
- 2: NaNs produced
r2<-length(la[1,]) ri<-tapply(la[,1],as.factor(GCAT.GeneList),mean) r3<-length(ri) AG<-matrix(nrow=r3,ncol=r2)
for(i in 1:r2) {AG[,i]<-tapply(la[,i],as.factor(GCAT.GeneList),mean)}
par(mfrow=c(3,3))
for(i in 1:r2) {plot(AG[,i],MC[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))} browser()
- Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window. To continue with the rest of the code, press Enter.
- To make sure that you save the clearest image, do not scroll in the window because a grey bar will appear if you do so.
- The next set of code is for the generation of the GCAT boxplots for the wild-type data.
x0<-tapply(MAG$A[,1],as.factor(MAG$genes$ID),mean) y0<-length(MAG$A[1,]) x1<-length(x0) AAG<-matrix(nrow=x1,ncol=y0)
for(i in 1:y0) {AAG[,i]<-tapply(MAG$A[,i],as.factor(MAG$genes$ID),mean)}
par(mfrow=c(3,3))
for(i in 1:y0) {plot(AAG[,i],MG2[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))} browser()
- Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window. To continue with the rest of the code, press Enter.
par(mfrow=c(1,3))
boxplot(MCD,main="Before Normalization",ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
axis(1,at=xy.coords(chips)$x,tick=TRUE,labels=FALSE)
text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)
boxplot(MG2,main='After Within Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
axis(1,at=xy.coords(chips)$x,labels=FALSE)
text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)
boxplot(MAD[,Gtop$MasterList],main='After Between Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
axis(1, at=xy.coords(chips)$x,labels=FALSE)
text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)
- Maximize the window in which the plots have appeared. You may not want to actually maximize them because you might lose the labels on the x axis, but make them as large as you can. Save the plots as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.
Create MA Plots and Box Plots for the Ontario Chips
Input the following code, line by line, into the main R window. Press the enter key after each block of code.
Ontario.GeneList<-RGO$genes$Name
lr<-log2((RGO$R-RGO$Rb)/(RGO$G-RGO$Gb))
- Warning message: "NaNs produced" is OK.
z0<-length(lr[1,]) v0<-tapply(lr[,1],as.factor(Ontario.GeneList),mean) z1<-length(v0) MT<-matrix(nrow=z1,ncol=z0)
for(i in 1:z0) {MT[,i]<-tapply(lr[,i],as.factor(Ontario.GeneList),mean)}
MI<-matrix(nrow=z1,ncol=z0)
for(i in 1:z0) {MI[,i]<-ds[i]*MT[,i]}
MID<-as.data.frame(MI) colnames(MID)<-headers rownames(MID)<-ontID
ln<-(1/2*log2((RGO$R-RGO$Rb)*(RGO$G-RGO$Gb)))
- Warning messages are OK:
- 1: In (RGO$R - RGO$Rb) * (RGO$G - RGO$Gb) :
- NAs produced by integer overflow
- 2: NaNs produced
z2<-length(ln[1,]) zi<-tapply(ln[,1],as.factor(Ontario.GeneList),mean) z3<-length(zi) AO<-matrix(nrow=z3,ncol=z2)
for(i in 1:z0) {AO[,i]<-tapply(ln[,i],as.factor(Ontario.GeneList),mean)}
strains<-c('wt','dCIN5','dGLN3','dHAP4','dHMO1','dSWI4','dZAP1','Spar')
- After entering the call browser() below, maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window and press Enter for the next set of graphs to appear.
- The last graph to appear will be the spar graphs.
- The graphs generated from this code are the before Ontario chips
- Be sure to save the 9 graphs before moving on to the next step
for (i in 1:length(strains)) { st<-strains[i] lt<-which(Otargets$Strain %in% st) if (st=='wt') { par(mfrow=c(3,5)) } else { par(mfrow=c(4,5)) } for (i in lt) { plot(AO[,i],MI[,i],main=headers[i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15)) } browser() }
- To continue generating plots, press enter.
j0<-tapply(MAO$A[,1],as.factor(MAO$genes[,5]),mean) k0<-length(MAO$A[1,]) j1<-length(j0) AAO<-matrix(nrow=j1,ncol=k0)
for(i in 1:k0) {AAO[,i]<-tapply(MAO$A[,i],as.factor(MAO$genes[,5]),mean)}
- Remember, that after entering the call readline('Press Enter to continue'), maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window and press Enter for the next set of graphs to appear.
- Again, the last graphs to appear will be the spar graphs.
- These graphs that are produced are for the after Ontario chips
- Again, be sure to save 9 graphs before moving on to the next part of the code.
for (i in 1:length(strains)) { st<-strains[i] lt<-which(Otargets$Strain %in% st) if (st=='wt') { par(mfrow=c(3,5)) } else { par(mfrow=c(4,5)) } for (i in lt) { plot(AAO[,i],MD2[,i],main=headers[i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15)) } browser() }
- To continue generating plots, press enter.
for (i in 1:length(strains)) { par(mfrow=c(1,3)) st<-strains[i] lt<-which(Otargets$Strain %in% st) if (st=='wt') { xcoord<-xy.coords(lt)$x-1 fsize<-0.9 } else { xcoord<-xy.coords(lt)$x-1.7 fsize<-0.8 } boxplot(MID[,lt],main='Before Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n') axis(1,at=xy.coords(lt)$x,labels=FALSE) text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE) boxplot(MD2[,lt],main='After Within Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n') axis(1,at=xy.coords(lt)$x,labels=FALSE) text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE) ft<-Otargets$MasterList[which(Otargets$Strain %in% st)] boxplot(MAD[,ft],main='After Between Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n') axis(1,at=xy.coords(lt)$x,labels=FALSE) text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE) browser() }
- To continue generating the box plots, press enter.
- You will have to save 9 plots before you have completed the procedure. The last box plot is for spar.
- Warnings are OK.
- Zip the files of the plots together and upload to LionShare and/or save to a flash drive.
Statistical Analysis
- Added the standard name and the master index for all the terms.
- Saved Compiled_Normalized_Data sheet and made a new sheet called Rounded_Normalized_Data
- ran computation
=ROUND(Compiled_Normalized_Data!D2,4)
for all data.
- Created Master Sheet, which has all knew data free of any computational functions
- Copied and pasted numbers with special paste: values
- In Master Sheet:
- Replaced #VALUE with a blank cell. There were 477 replacements
- Created a new worksheet and named it"(dgln3_ANOVA)
- Copied all of the data from the "Master_Sheet" worksheet for your strain and pasted it into the new worksheet.
- At the top of the first column to the right of Spar_LogFC_t120-4 (FD), five column headers were created of the form dgln3_AvgLogFC_(TIME) where (TIME) is 15, 30, 60, 90, 120.
- In the cell below the dgln3_AvgLogFC_t15 header, I typed
=AVERAGE(
- highlighted all the data in row 2 associated with dgln3_LogFC_t15 (AU2:AX2), press the closing paren key (shift 0),and press the "enter" key.
- This cell now contains the average of the log fold change data from the first gene at t=15 minutes.
- Clicked on this cell and position your cursor at the bottom right corner. Double clicked, and the formula was copied to the entire column of 6188 other genes.
- Repeated steps (4) through (8) with the t30, t60, t90, and the t120 data.
- Create the column header dgln3_ss_HO in cell FJ1.
- In FJ2, I typed =SUMSQ(AU2:BN2)
- In FK1, create the column headers dgln3_ss_(TIME) as in (3).
- Make a note of how many data points you have at each time point for your strain.
- 15:4
- 30:4
- 60:4
- 90:4
- 120:4
- In FK2, type
=SUMSQ(<range of cells for logFC_t15>)-<number of data points>*<AvgLogFC_t15>^2
and hit enter.- The phrase <range of cells for logFC_t15> should be replaced by the data range associated with t15.
- The phrase <number of data points> should be replaced by the number of data points for that timepoint (either 3, 4, or 5).
- The phrase <AvgLogFC_t15> should be replaced by the cell number in which you computed the AvgLogFC for t15, and the "^2" squares that value.
- Actual Computation:=SUMSQ(AU2:AX2)-4*FE2^2
- Upon completion of this single computation, copy the formula throughout the column.
- Repeated this computation for the t30 through t120 data points.
- In FP1, create the column header dgln3_SS_full.
- In the first row below this header, type
=sum(<range of cells containing "ss" for each timepoint>)
and hit enter.- Actual Computation: =SUM(FK2:FO2)
- In the next two columns to the right, create the headers dgln3_Fstat and dgln3_p-value.
- Recall the number of data points from (13): call that total n.
- In the first cell of the dgln3_Fstat column, type
=((n-5)/5)*(dgln3_ss_HO-dgln3_SS_full)/dgln3_SS_full
and hit enter.- =((20-5)/5)*(FJ2-FP2)/FP2
- Copy to the whole column.
- In the first cell below the dgln3_p-value header, type
=FDIST(<(dgln3)_Fstat>,5,n-5)
Calculate the Bonferroni and p value Correction
- Labeled FS1 and FT1 dgln3_Bonferroni_p-value.
- Type the equation
=<dgln3_p-value>*6189
, Upon completion of this single computation, copy the formula throughout the column. - Replaced any corrected p value that is greater than 1 by the number 1 by typing the following formula into FT2
=IF(r2>1,1,r2)
. Copy the formula throughout the column.
Calculate the Benjamini & Hochberg p value Correction
- Insert a new worksheet named "dgln3_B&H".
- Copy and paste the "MasterIndex", "ID", and "Standard Name" columns from your previous worksheet into the first two columns of the new worksheet.
- Copied unadjusted p values from ANOVA worksheet and pasted it into Column D.
- Selected all of columns A, B, C, and D. Sorted by ascending values on Column D. Clicked the sort button from A to Z on the toolbar, sorted by column C, smallest to largest.
- Typeed the header "Rank" in cell E1. Stretched this down to 6190.
- Calculated the Benjamini and Hochberg p value correction. Typed dgln3_B-H_p-value in cell F1. Typed the following formula in cell F2:
=(D2*6189)/E2
and copied that equation to the entire column. - Typed "dgln3_B-H_p-value" into cell G1.
- Typed the following formula into cell G2:
=IF(F2>1,1,F2)
Copied that equation to the entire column. - Selected columns A through G. Sorted them by your MasterIndex in Column A in ascending order.
- Copy column G and use Paste special > Paste values to paste it into the next column on the right of your ANOVA sheet.
- Upload the .xlsx file that you have just created to LionShare. Send Dr. Dahlquist an e-mail with the link to the file (e-mail kdahlquist at lmu dot edu).
Sanity Check
- Click on cell A1 and click on the Data tab. Select the Filter icon (looks like a funnel). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
- Click on the drop-down arrow on dgln3_p-value. Select "Number Filters". In the window that appears, set a criterion that will filter your data so that the p value has to be less than 0.05.
- p<0.05=1856 (
- p<0.01=1007
- p<0.001=398
- p<0.0001=121
- Bonderroni = 20
- B&H = 889