Laurmagee: Week 8
From LMU BioDB 2013
Part 1
- First download the following data file: File:Merrell Compiled Raw Data Vibrio.xls
- Insert a new spreadsheet in the label it "scaled_centered".
- Select all the data in your previous worksheet and copy it into the new spreadsheet
- Insert two rows in between the top row of headers and the first data row.
- In cell A2, type "Average" and in cell A3, type "StdDev".
- Compute the Average log ratio for each chip (each column of data). In cell B2, type the following equation: =AVERAGE(B4:B5224)
- Compute the Standard Deviation of the log ratios on each chip (each column of data). In cell B3, type the following equation: =STDEV(B4:B5224)
- Drag these lines into the rest of the columns, because Excel will automatically change the equation to match your specific column.
- Copy the column headings for all of your data columns and then paste them to the right of the last data column so that you have a second set of headers above blank colums of cells. Edit the names of the columns so that they now read: A1_scaled_centered, A2_scaled_centered, etc.
- In cell N4, type the following equation: =(B4-B$2)/B$3 The dollar signs are important because we will be using the same mean (B2) and standard deviation (B3) in all the succeeding columns values as well.
- Copy and paste the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header. Be sure that your equation is correct for the column you are calculating (i.e. change the values to match the cells with the correct average and standard deviation).
- Insert a new worksheet and name it "statistics".
- Go back to the "scaling_centering" worksheet and copy the first column ("ID"). And paste these values into the first column of your new spreadsheet.
- Go back to the "scaling_centering" worksheet and copy the columns that are designated "_scaled_centered". Click on the B1 cell and select "Paste Special" from the Edit menu. A window will open: click on the radio button for "Values" and click OK.
- Go to a new column on the right of your worksheet. Type the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns.
- Compute the average log fold change for the replicates for each patient by typing the equation: =AVERAGE(B2:E2) into cell N2. Copy this equation and paste it into the rest of the column.
- Create the equation for patients B and C and paste it into their respective columns.
- Now you will compute the average of the averages. Type the header "Avg_LogFC_all" into the first cell in the next empty column.
- Create the equation that will compute the average of the three previous averages you calculated and paste it into this entire column.
- Insert a new column labeled "Tstat".
- Enter the equation: =AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(3)) and paste it into all rows in that column.
- Label the top cell in the next column "Pvalue" and enter the equation: =TDIST(ABS(R2),2,2) paste it into all rows in that column.
- Insert a new worksheet and name it "forGenMAPP".
- Go back to the "statistics" worksheet and Select All and Copy.
- Go to your new sheet and click on cell A1 and select Paste Special, click on the Values radio button, and click OK.
- Select Columns B through Q (all the fold changes). Select the menu item Format > Cells. Under the number tab, select 2 decimal places. Click OK.
- Select Columns R and S. Select the menu item Format > Cells. Under the number tab, select 4 decimal places. Click OK.
- Select Columns N through S and Cut. Select Column B by left-clicking on the "B" at the top of the column. Then right-click on the Column B header and select "Insert Cut Cells". This will insert the data without writing over your existing columns.
- Insert a column to the right of the "ID" column. Type the header "SystemCode" into the top cell of this column. Fill the entire column (each cell) with the letter "N".
- Select the menu item File > Save As, and choose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu.
- To make sure that we performed our data analysis correctly, first open your spreadsheet and go to the "forGenMAPP" tab.
- Click on cell A1 and select the menu item Data > Filter > Autofilter.
- Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
- How many genes have p value < 0.05?
- 5221 genes.
- What about p < 0.01?
- 169 genes.
- What about p < 0.001?
- 19 genes.
- What about p < 0.0001?
- 1 gene.
- Keeping the "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there?
- 300 genes
- Keeping the "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero. How many are there?
- 454 genes
- What about an average log fold change of > 0.25 and p < 0.05?
- 281 genes
- Or an average log fold change of < -0.25 and p < 0.05? (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
- 442 genes
- What criteria did Merrell et al. (2002) use to determine a significant gene expression change? How does it compare to our method?
- Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. Look these genes up in your spreadsheet. What are their fold changes and p values? Are they significantly changed in our analysis?