Difference between revisions of "Vpachec3 Week 14"
(→Sanity Check: Number of genes significantly changed: syntax) |
(→Sanity Check: Number of genes significantly changed: Answered the remain questions in this section) |
||
Line 46: | Line 46: | ||
** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero. | ** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero. | ||
'''''How many are there? (and %)''''' | '''''How many are there? (and %)''''' | ||
− | + | ***3279 genes which is 45% | |
** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero. | ** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero. | ||
'''''How many are there? (and %)''''' | '''''How many are there? (and %)''''' | ||
− | + | ***3127 genes which is 43% | |
** '''''What about an average log fold change of > 0.25 and p < 0.05? (and %)''''' | ** '''''What about an average log fold change of > 0.25 and p < 0.05? (and %)''''' | ||
+ | ***1613 genes which is 22% | ||
** '''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.) | ** '''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.) | ||
+ | ***1519 genes which is 21% | ||
===Sanity Check: Compare individual genes with known data=== | ===Sanity Check: Compare individual genes with known data=== |
Revision as of 20:06, 5 December 2015
Contents
Thursday,December 3
- Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
- Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
- Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like
=AVERAGE(C2,M2,W2,AG2)
We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.
- Then you need to average the averages for biofilm and for tobramycin. (It doesn't make sense to average biofilm and tobramycin together since they are separate treatments).
- Because your reference sample is genomic DNA and not RNA, you need to then take the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Because the numbers are in log space, you will subtract the biofilm average from the tobramycin average to get this number.
- You will conduct a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples using the TTEST function in Excel, not the equation we did for Vibrio. It will directly compute the p value.
- Then you can compute the Bonferroni and Benjamini and Hochberg corrected p values like you did in the Vibrio exercise.
Sanity Check: Number of genes significantly changed
Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results of Merrell et al. (2002).
- Open your spreadsheet and go to the "forGenMAPP" tab.
- Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
- Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
- How many genes have p value < 0.05? and what is the percentage (out of 7251)?
- 4318 genes which is 60%
- What about p < 0.01? and what is the percentage (out of 7251)?
- 2971 genes which is 41%
- What about p < 0.001? and what is the percentage (out of 7251)?
- 1460 genes which is 20%
- What about p < 0.0001? and what is the percentage (out of 7251)?
- 645 genes which is 9%
- How many genes have p value < 0.05? and what is the percentage (out of 7251)?
- When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
- We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: http://xkcd.com/882/.) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
- How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?
- 179 genes which is 2.4%
- How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?
- 605 genes which is 8.3%
- How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?
- In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.
- The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
- Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
How many are there? (and %)
- 3279 genes which is 45%
- Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
How many are there? (and %)
- 3127 genes which is 43%
- What about an average log fold change of > 0.25 and p < 0.05? (and %)
- 1613 genes which is 22%
- What about an average log fold change of > 0.25 and p < 0.05? (and %)
- Or an average log fold change of < -0.25 and p < 0.05? (and %) (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
- 1519 genes which is 21%
- Or an average log fold change of < -0.25 and p < 0.05? (and %) (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
Sanity Check: Compare individual genes with known data
- Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. Look these genes up in your spreadsheet. What are their fold changes and p values? Are they significantly changed in our analysis?
VC0028
Fold Change:1.65, 1.27
P-Value: first entry = 0.0474, 0.0692
Significance: statistically significant, not statistically significant
VC0941
Fold Change:0.09, -0.28
P-Value: 0.6759, 0.1636
Significance:not statistically significant, not statistically significant
VC0869
Fold Change :1.59, 1.95, 2.20, 1.50, 2.12
P-Value:0.0463,0.0227,0.0020,0.0174,0.0200
Significance:significant,significant,significant,significant,significant
VC0051
Fold Change:1.92, 1.89
P-Value:0.0139,0.0160
Significance:statistically significant,statistically significant
VC0468
Fold Change: -0.17
P-Value: 0.3350
Significance: not statistically significant
VC2350
Fold Change: -2.40
P-Value: 0.0130
Significance: statistically significant
VCA0583
Fold Change: 1.06
P-Value: 0.1011
Significance: not statistically significant