Difference between revisions of "Data Analysis"

Latest revision as of 09:53, 26 April 2024

Final Project Links
Overview	Deliverables	Guilds	Project Manager	Quality Assurance	Data Analysis	Coder/Designer
Overview	Deliverables	Team	Yeast Beasts

The role of the Data Analyst will be to apply the data analysis pipeline that you learned by analyzing the Dahlquist Lab microarray dataset to complete the analysis of a different published yeast timecourse microarray dataset. The Data Analysts are the end-users of the project, ultimately determining whether the work of the coder/designer and quality assurance members is useful to them.

Guild Members

Katie & Charlotte

Milestones

The milestones do not necessarily correspond to particular days/weeks; instead they are sets of tasks grouped together.

Data Analysts can have a shared individual journal entry. Both students will be given the same grade and are expected to contribute equally to the electronic lab notebook.
Detailed notes should be taken throughout consistent with reproducible research and contributing to the final deliverables.

Milestone 1: Journal Club Presentation

The Data Analysts will work with their teams to create and deliver a Journal Club presentation about to their team's assigned paper.

Milestone 2: Getting the data ready for analysis

Download and examine the microarray dataset, comparing it to the samples and experiment described in your journal club article.
- Link to processed dataset from SGD.
- For your reference, this is the link to the dataset at the GEO Database:GSE26169. However, we will use the dataset processed by SGD.
Along with the QA's, make a "sample-data relationship table" that lists all of the samples (microarray chips), noting the treatment, time point, and replicate number.
- Come up with consistent column headers that summarize this information
  - For example, the Dahlquist Lab microarray data used strain_LogFC_timepoint-replicate number, as in wt_LogFC_t15-1.
  - Do not use any special characters except for "-" or "_" (e.g., no commas, etc.)
Organize the data in a worksheet in an Excel workbook so that:
- ID (SGD systematic name) is in the first column
- Data columns are to the right, in increasing chronological order, using the column header pattern you created
- Treatments are grouped together
- Replicates are grouped together
- Delete the "EWEIGHT" row and "GWEIGHT" column.
These data are from an Affymetrix, single-color chip. So, instead of ratios, the values are raw intensity values. We need to convert the data into Log₂ fold changes (LogFC).
- The raw intensity values have been log-transformed, which we need to undo before we calculate the ratios.
- Create new column headers and then do the transformation of all the data with the equation =2^<cell designation>
- Compute the average value of the t0 timepoint for the control and CHP-treated data.
- Calculate the fold change by dividing each value for each timepoint by the average t0 value for the respective treatment (control or CHP-treated).
- Log₂ transform the fold changes.
- The data are now ready for the next step.

Milestone 3: ANOVA analysis

Perform an ANOVA analysis of the data (including the Bonferroni and Benjamini and Hochberg corrections), as you did on Week 9 for the Dahlquist lab data.
- Perform the ANOVA separately for the Control vs. the CHP-treated data.
- Note that you will need to adjust your formulas to take into account the different number of timepoints and replicates in your article's dataset.
Perform a "Sanity Check" and create a table of p value counts like you did for the Week 9 assignment. This time, do the Sanity Check on the Benjamini-Hochberg corrected p values instead of the unadjusted p values.
- You will need to determine a suitable p-value cut-off for the clustering analysis. A suitable cut-off will include about 25% of the total number of genes.

Milestone 4: Clustering with stem and YEASTRACT

Cluster the CHP-treated data with stem, as you did on Week 10. Collect the screenshots of the main clustering results and the plots of all significant clusters. What are the main patterns that you see? Are they similar to what was reported in the Sha et al. (2013) paper?
- Note that we will make some adjustments to the GO term analysis because stem was not providing GO term names. We are going to use the GO enrichment tool at GeneOntology.org instead.
1. Go to http://geneontology.org/.
2. Select two clusters with different patterns that have a reasonably large number of genes. For each cluster you want to analyze, open the gene list and copy the list of genes.
3. Paste the list of genes into the "Go Enrichment Analysis" box on the right hand side of the GeneOntology.org page.
4. Select "Saccharomyces cerevisiae" from the species drop-down menu.
5. Click the "Launch" button.
6. Near the bottom of the results page, click on the button to Export "Table".
7. This will prompt you to save a .txt file that can be opened in Excel to view your results.
Use YEASTRACT to generate a list of candidate regulatory transcription factors in Week 10.

Milestone 5: Create a candidate gene regulatory network and input workbook for GRNmap using MS Access database

Create an input workbook for GRNmap using queries to the Microsoft Access database that the Coder/Designer and QA's make. The main worksheets you will need to create are as follows:
- production_rates
- degradation_rates
- wt_log2_expression (based on the CHP-treated data)
- network
Run GRNmap in Dr. Dahlquist's research lab (make appointment) and interpret data.
As the end-user of the Access database, the Data Analysts will provide feedback to the QAs and Coder/Designer about the usability of database.

Using Microsoft Access Query Design

This is a loose set of instructions on how to use your Microsoft Access database to make the GRNmap input workbook.

Import a table into the database that is the list of regulatory transcription factors that need to be included in the network (get from the Data Analysis team).
Go to the Query Design view and select the tables that you need for the query. (For example, the TF table you just imported and the production_rates table).
Link the ID fields that are equivalent.
Right-click on the line between the fields and set the join properties:
- Include all the records from the TF table, and only those records from the other table that match.
Select the fields from the tables that you want to be output in the query and drag them to the grids at the bottom of the window.
Choose "Make Table" query so that your results will be stored in a table.
Run the query.
Export the table created as tab-delimited text file. Bring it into Excel.
Repeat as needed to create all of the worksheets you need.

Final Project Links
Overview	Deliverables	Guilds	Project Manager	Quality Assurance	Data Analysis	Coder/Designer
Overview	Deliverables	Team	Yeast Beasts

@@ Line 12: / Line 12: @@
 * Data Analysts can have a shared ''individual'' journal entry.  Both students will be given the same grade and are expected to contribute equally to the electronic lab notebook.
+* Detailed notes should be taken throughout consistent with reproducible research and contributing to the final deliverables.
 === Milestone 1: Journal Club Presentation ===
@@ Line 31: / Line 32: @@
 #* Treatments are grouped together
 #* Replicates are grouped together
+#* Delete the "EWEIGHT" row and "GWEIGHT" column.
+# These data are from an Affymetrix, single-color chip.  So, instead of ratios, the values are raw intensity values.  We need to convert the data into Log<sub>2</sub> fold changes (LogFC).
+#* The raw intensity values have been log-transformed, which we need to undo before we calculate the ratios.
+#* Create new column headers and then do the transformation of all the data with the equation <code>=2^<cell designation></code>
+#* Compute the average value of the t0 timepoint for the control and CHP-treated data.
+#* Calculate the fold change by dividing each value for each timepoint by the average t0 value for the respective treatment (control or CHP-treated).
+#* Log<sub>2</sub> transform the fold changes.
+#* The data are now ready for the next step.
 === Milestone 3:  ANOVA analysis ===
-# Perform an ANOVA analysis of the data, as you did on [[Week 9]] for the Dahlquist lab data.
+# Perform an ANOVA analysis of the data (including the Bonferroni and Benjamini and Hochberg corrections), as you did on [[Week 9]] for the Dahlquist lab data.
+#* Perform the ANOVA separately for the Control vs. the CHP-treated data.
 #* Note that you will need to adjust your formulas to take into account the different number of timepoints and replicates in your article's dataset.
-#* Also note that you will need to consult with Dr. Dahlquist on how to convert the Affymetrix data from the paper into Log2 fold changes.
+# Perform a "Sanity Check" and create a table of p value counts like you did for the [[Week 9]] assignment.  This time, do the Sanity Check on the Benjamini-Hochberg corrected p values instead of the unadjusted p values.
+#* You will need to determine a suitable p-value cut-off for the clustering analysis.  A suitable cut-off will include about 25% of the total number of genes.
 === Milestone 4:  Clustering with stem and YEASTRACT ===
-# Cluster the data with stem, as you did on [[Week 10]].
+# Cluster the CHP-treated data with stem, as you did on [[Week 10]].  Collect the screenshots of the main clustering results and the plots of all significant clusters.  What are the main patterns that you see?  Are they similar to what was reported in the Sha et al. (2013) paper?
 #* Note that we will make some adjustments to the GO term analysis because stem was not providing GO term names.  We are going to use the GO enrichment tool at GeneOntology.org instead.
 ## Go to [http://geneontology.org/ http://geneontology.org/].
-## For the cluster you want to analyze, open the gene list and copy the list of genes.
+## Select two clusters with different patterns that have a reasonably large number of genes.  For each cluster you want to analyze, open the gene list and copy the list of genes.
 ## Paste the list of genes into the "Go Enrichment Analysis" box on the right hand side of the GeneOntology.org page.
 ## Select "Saccharomyces cerevisiae" from the species drop-down menu.
@@ Line 49: / Line 60: @@
 ## Near the bottom of the results page, click on the button to Export "Table".
 ## This will prompt you to save a .txt file that can be opened in Excel to view your results.
-# Use YEASTRACT to generate a candidate gene regulatory network as you did on [[Week 10]].
+# Use YEASTRACT to generate a list of candidate regulatory transcription factors in [[Week 10]].
-=== Milestone 5:  Create an input workbook for GRNmap using MS Access database ===
+=== Milestone 5:  Create a candidate gene regulatory network and input workbook for GRNmap using MS Access database ===
-# Create an input workbook for GRNmap using GRNsight and the Microsoft Access database that the Coder/Designer and QA's make, protocol ''TBA''.
+# Create an input workbook for GRNmap using queries to the Microsoft Access database that the Coder/Designer and QA's make.  The main worksheets you will need to create are as follows:
+#* <code>production_rates</code>
+#* <code>degradation_rates</code>
+#* <code>wt_log2_expression</code> (based on the CHP-treated data)
+#* <code>network</code>
 # Run GRNmap in Dr. Dahlquist's research lab (make appointment) and interpret data.
 # As the end-user of the Access database, the Data Analysts will provide feedback to the QAs and Coder/Designer about the usability of database.
+==== Using Microsoft Access Query Design ====
+This is a loose set of instructions on how to use your Microsoft Access database to make the GRNmap input workbook.
+# Import a table into the database that is the list of regulatory transcription factors that need to be included in the network (get from the Data Analysis team).
+# Go to the Query Design view and select the tables that you need for the query.  (For example, the TF table you just imported and the production_rates table).
+# Link the ID fields that are equivalent.
+# Right-click on the line between the fields and set the join properties:
+#* Include all the records from the TF table, and only those records from the other table that match.
+# Select the fields from the tables that you want to be output in the query and drag them to the grids at the bottom of the window.
+# Choose "Make Table" query so that your results will be stored in a table.
+# Run the query.
+# Export the table created as tab-delimited text file.  Bring it into Excel.
+# Repeat as needed to create all of the worksheets you need.
 {{Final Project Links}}
 [[Category:Team Project]]

Difference between revisions of "Data Analysis"

Latest revision as of 09:53, 26 April 2024

Contents

Guild Members

Milestones

Milestone 1: Journal Club Presentation

Milestone 2: Getting the data ready for analysis

Milestone 3: ANOVA analysis

Milestone 4: Clustering with stem and YEASTRACT

Milestone 5: Create a candidate gene regulatory network and input workbook for GRNmap using MS Access database

Using Microsoft Access Query Design

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools