Difference between revisions of "Kevin Wyllie Week 14"
From LMU BioDB 2015
(Added my good practice links.) |
(Added more protocol.) |
||
Line 3: | Line 3: | ||
=== Statistical Analysis and Formatting === | === Statistical Analysis and Formatting === | ||
− | '''We looked at Dr. Dahlquist's comments on our prior worksheet to make a second attempt at processing and formatting for GenMAPP. | + | '''We looked at Dr. Dahlquist's comments on our prior worksheet to make a second attempt at processing and formatting for GenMAPP. Our biggest error in the the previous attempt was forgetting to account for the fact that each gene was spotted in (technical) quadruplicates.''' |
# Dr. Dahlquist prepared a sheet for us to start with, which separated each quadruplicate of the genome across columns. So the ID column for the first technical replicate was in column A, followed by fold change data for one of the four technical replicates within each of the biological replicates, for the (untreated) biofilm and tobramycin-treated biofilm samples. Next was the second ID column (identical to the first) in column J, followed by the second set of technical replicates within each biological replicate. These sheet was named "'''quadruplicated_spots_separated'''". | # Dr. Dahlquist prepared a sheet for us to start with, which separated each quadruplicate of the genome across columns. So the ID column for the first technical replicate was in column A, followed by fold change data for one of the four technical replicates within each of the biological replicates, for the (untreated) biofilm and tobramycin-treated biofilm samples. Next was the second ID column (identical to the first) in column J, followed by the second set of technical replicates within each biological replicate. These sheet was named "'''quadruplicated_spots_separated'''". | ||
Line 10: | Line 10: | ||
#* Because all five untreated and all three treated samples will be averaged later in the protocol, the word "consolidated" (instead of "average") was used in the headings for these columns. For example, the column heading for the first biological untreated sample was "consolidated_Biofilm_1_scaled_centered_4". | #* Because all five untreated and all three treated samples will be averaged later in the protocol, the word "consolidated" (instead of "average") was used in the headings for these columns. For example, the column heading for the first biological untreated sample was "consolidated_Biofilm_1_scaled_centered_4". | ||
# Next, a new sheet was created, named "'''statistics'''". Pasted into this sheet (using the "Paste Values" function) were the "consolidated" columns (preceded by the ID column). Thus, the ID's were in column A while the consolidated fold changes for each biological replicate were in columns B through J. The average across the five untreated samples were calculated in column K (using the previously mentioned command), with the header "AVG_Biofilm_scaled_centered". In column L, the same was done using the three treated samples. | # Next, a new sheet was created, named "'''statistics'''". Pasted into this sheet (using the "Paste Values" function) were the "consolidated" columns (preceded by the ID column). Thus, the ID's were in column A while the consolidated fold changes for each biological replicate were in columns B through J. The average across the five untreated samples were calculated in column K (using the previously mentioned command), with the header "AVG_Biofilm_scaled_centered". In column L, the same was done using the three treated samples. | ||
− | # In column M, the ratios between the averages for treated and untreated samples were calculated. However, because these fold changes are in log space the fold changes for the treated samples were ''subtracted'' from that of the untreated samples. | + | # In column M, the ratios between the averages for treated and untreated samples were calculated. However, because these fold changes are in log space the fold changes for the treated samples were ''subtracted'' from that of the untreated samples. The header for this column was "Biofilm_Tobramycin_ratio". |
#* An example command for the gene in 2, is <code>=L2-K2</code>. | #* An example command for the gene in 2, is <code>=L2-K2</code>. | ||
− | # Next, the p-value's for each fold change ratio was calculated. A type-3, 2-tailed T-test was used. | + | # Next, the p-value's for each fold change ratio was calculated in column M. A type-3, 2-tailed T-test was used. |
#* The command for the first gene (row 2) is <code>=TTEST(C2:G2,H2:J2,2,3)</code>. As before, this was pasted through all genes. | #* The command for the first gene (row 2) is <code>=TTEST(C2:G2,H2:J2,2,3)</code>. As before, this was pasted through all genes. | ||
− | # The next sheet was named "'''bonferroni_pval'''". | + | # The next sheet was named "'''bonferroni_pval'''". Pasted into this sheet were the gene ID's (column A), the biofilm-tobramycin fold change ratios (column B) and the previously calculated P-values (column C). |
− | + | # The header for column D was "bonferroni_p_value," and in this column were the Bonferroni adjusted P-values. To calculate this value, each previous P-value was multiplied by 7,251 (the number of genes on the sheet). | |
+ | #* For example, the command for the first gene (cell D2) was <code>=C2*7251</code>. | ||
+ | # Because this yields many P-values which are over 1 (which statistically speaking, should be impossible), a second "bonferroni_p_value" column was added next to the previous one (column E). In this column, P-values over 1 were replaced with a 1. | ||
+ | #* The command for cell E2 was <code>=IF(D2>1,1,D2)</code>. | ||
+ | # The next sheet was named "'''BH_pval'''". | ||
+ | # First, the same information pasted into the "bonferroni_p_value" sheet was pasted into this new sheet. | ||
+ | # A column was then inserted between columns A and B. The header for this new column (now column B), was "Master Index." With the gene ID's in numerical order, an ascending count was added to the Master Index column. This was done by adding values 1 and 2 to cells B2 and B3 respectively, highlighting these two cells, and right-clicking on the black square at the bottom-right of the highlighted region. This will automatically fill in the ascending count. | ||
+ | # A filter was then added to the P-value column (these are the unadjusted P-values, not the previously calculated Bonferroni-adjusted ones), and was used to sort this column in ascending order. | ||
=== Sanity Check === | === Sanity Check === |
Revision as of 19:58, 7 December 2015
Electronic Lab Notebook
Statistical Analysis and Formatting
We looked at Dr. Dahlquist's comments on our prior worksheet to make a second attempt at processing and formatting for GenMAPP. Our biggest error in the the previous attempt was forgetting to account for the fact that each gene was spotted in (technical) quadruplicates.
- Dr. Dahlquist prepared a sheet for us to start with, which separated each quadruplicate of the genome across columns. So the ID column for the first technical replicate was in column A, followed by fold change data for one of the four technical replicates within each of the biological replicates, for the (untreated) biofilm and tobramycin-treated biofilm samples. Next was the second ID column (identical to the first) in column J, followed by the second set of technical replicates within each biological replicate. These sheet was named "quadruplicated_spots_separated".
- After all four sets of technical replicates, yet another ID column (column AK) was added, following by the averaged fold change values across each technical replicate (for each biological replicate - columns AL through AS).
- For these averages the
=AVERAGE(*four technical replicates*)
command was used and pasted through the entirety of each column. The *four technical replicates* is a placeholder for the cell coordinates corresponding to each technical replicate within a biological replicate. - Because all five untreated and all three treated samples will be averaged later in the protocol, the word "consolidated" (instead of "average") was used in the headings for these columns. For example, the column heading for the first biological untreated sample was "consolidated_Biofilm_1_scaled_centered_4".
- For these averages the
- Next, a new sheet was created, named "statistics". Pasted into this sheet (using the "Paste Values" function) were the "consolidated" columns (preceded by the ID column). Thus, the ID's were in column A while the consolidated fold changes for each biological replicate were in columns B through J. The average across the five untreated samples were calculated in column K (using the previously mentioned command), with the header "AVG_Biofilm_scaled_centered". In column L, the same was done using the three treated samples.
- In column M, the ratios between the averages for treated and untreated samples were calculated. However, because these fold changes are in log space the fold changes for the treated samples were subtracted from that of the untreated samples. The header for this column was "Biofilm_Tobramycin_ratio".
- An example command for the gene in 2, is
=L2-K2
.
- An example command for the gene in 2, is
- Next, the p-value's for each fold change ratio was calculated in column M. A type-3, 2-tailed T-test was used.
- The command for the first gene (row 2) is
=TTEST(C2:G2,H2:J2,2,3)
. As before, this was pasted through all genes.
- The command for the first gene (row 2) is
- The next sheet was named "bonferroni_pval". Pasted into this sheet were the gene ID's (column A), the biofilm-tobramycin fold change ratios (column B) and the previously calculated P-values (column C).
- The header for column D was "bonferroni_p_value," and in this column were the Bonferroni adjusted P-values. To calculate this value, each previous P-value was multiplied by 7,251 (the number of genes on the sheet).
- For example, the command for the first gene (cell D2) was
=C2*7251
.
- For example, the command for the first gene (cell D2) was
- Because this yields many P-values which are over 1 (which statistically speaking, should be impossible), a second "bonferroni_p_value" column was added next to the previous one (column E). In this column, P-values over 1 were replaced with a 1.
- The command for cell E2 was
=IF(D2>1,1,D2)
.
- The command for cell E2 was
- The next sheet was named "BH_pval".
- First, the same information pasted into the "bonferroni_p_value" sheet was pasted into this new sheet.
- A column was then inserted between columns A and B. The header for this new column (now column B), was "Master Index." With the gene ID's in numerical order, an ascending count was added to the Master Index column. This was done by adding values 1 and 2 to cells B2 and B3 respectively, highlighting these two cells, and right-clicking on the black square at the bottom-right of the highlighted region. This will automatically fill in the ascending count.
- A filter was then added to the P-value column (these are the unadjusted P-values, not the previously calculated Bonferroni-adjusted ones), and was used to sort this column in ascending order.
Sanity Check
There are 7251 genes in the sheet.
- How many genes have p value < 0.05? And what is the percentage?
- 4318, ~60%
- What about p < 0.01?
- 2971, ~41%
- What about p < 0.001?
- 1460, ~20%
- What about p < 0.0001?
- 645, ~9%
- How many genes are p < 0.05 for the Bonferroni-corrected p value?
- 179, ~2.4%
- How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value?
- 609, ~8.4%
- "Pvalue" < 0.05, and "Biofilm_Tobramycin_ratio" > 0.
- "Pvalue" < 0.05, and "Biofilm_Tobramycin_ratio" > 0.
This is a more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology:
- "Biofilm_Tobramycin_ratio" > 0.25 and "Pvalue" < 0.05.
- "Biofilm_tobramycin_ratio" < -0.25 and "Pvalue" < 0.05.
Links
Weekly Group Assignments | Shared Group Journals | Project Links | Team Members |
---|---|---|---|
|
|
|
|
- Kevin Wyllie Week 2 (See the original assignment and class journal.)
- Kevin Wyllie Week 3 (See the original assignment and class journal.)
- Kevin Wyllie Week 4 (See the original assignment and class journal.)
- Kevin Wyllie Week 5 (See the original assignment and class journal.)
- Kevin Wyllie Week 6 (See the original assignment and class journal.)
- Kevin Wyllie Week 7 (See the original assignment and class journal.)
- Kevin Wyllie Week 8 (See the original assignment and class journal.)
- Kevin Wyllie Week 9 (See the original assignment and class journal.)
- Kevin Wyllie Week 10 (See the original assignment.)
- Kevin Wyllie Week 11 (See the original assignment.)
- Kevin Wyllie Week 12 (See the original assignment.)
- Kevin Wyllie Week 14 (See the original assignment.)
- Kevin Wyllie Week 15 (See the original assignment.)