Difference between revisions of "Data Analysts Week 13"

From LMU BioDB 2024
Jump to navigation Jump to search
(Charlotte and Katie's Data Analyst Journal: adding another detail)
(fixed template)
 
(19 intermediate revisions by 3 users not shown)
Line 6: Line 6:
 
===Milestone 2===
 
===Milestone 2===
 
#With Quality Assurance team member [[User:Hivanson| Hailey Ivanson]], we downloaded and examined the microarray dataset: [https://sgd-prod-upload.s3.amazonaws.com/S000204389/Sha_2013_PMID_24073228.zip SGD Processed Data].
 
#With Quality Assurance team member [[User:Hivanson| Hailey Ivanson]], we downloaded and examined the microarray dataset: [https://sgd-prod-upload.s3.amazonaws.com/S000204389/Sha_2013_PMID_24073228.zip SGD Processed Data].
#We made a sample-data relationship table in Excel labeled "reorganized" that lists all of the samples and at which time point they were collected, and their replicate number. We came up with consistent column headers that summarize this information. We named each column either Control_LogFC_timepoint-replicatenumber and CHP_LogFC_timepoint-replicatenumber, as in Control_LogFC_0-1 and CHP_LogFC_0-1. The timepoint for each is either 0, 3 , 6, 12, 20, 40, 70, or 120, and the replicate number either 1, 2, or 3. We organized the data in a worksheet in an Excel workbook so that:  
+
#We made a sample-data relationship table in Excel labeled "reorganized" that lists all of the samples and at which time point they were collected, and their replicate number. We came up with consistent column headers that summarize this information. We named each column either Control_LogFC_timepoint-replicatenumber or CHP_LogFC_timepoint-replicatenumber, as in Control_LogFC_0-1 and CHP_LogFC_0-1. The timepoint for each is either 0, 3 , 6, 12, 20, 40, 70, or 120, and the replicate number either 1, 2, or 3. We organized the data in a worksheet in an Excel workbook so that:  
 
#*ID is the first column header, and within it are all of the SGD systemic names
 
#*ID is the first column header, and within it are all of the SGD systemic names
#*Data columns are to the right, in increasing chronological order, using the column header pattern we created.
+
#*Data columns are to the right, in increasing chronological order, using the column header pattern we created
 
#*Treatments are grouped together
 
#*Treatments are grouped together
 
#*Replicates are grouped together
 
#*Replicates are grouped together
 
#*We deleted the "EWEIGHT" row and "GWEIGHT" column.
 
#*We deleted the "EWEIGHT" row and "GWEIGHT" column.
#*We then had to undo the log-transformed raw intensity values. We first created new columns for each respective trial in the formats Control_FC_timepoint-replicatenumber and CHP_FC_timepoint-replicatenumber, as in Control_FC_0-1 and CHP_FC_0-1. We then transformed the data in the first cell of each column with the equation <code>=2^<cell designation></code>, where the cell designation is the first cell of the Control_LogFC_timepoint-replicatenumber and CHP_LogFC_timepoint-replicatenumber columns, and then applied this command to the remaining cells of the column.  
+
#*We then had to undo the log-transformed raw intensity values. We first created new columns for each respective trial in the formats Control_FC_timepoint-replicatenumber and CHP_FC_timepoint-replicatenumber, as in Control_FC_0-1 and CHP_FC_0-1. We then transformed the data in the first cell of each column with the equation <code>=2^<cell designation></code>, where the cell designation is the first cell of the respective Control_LogFC_timepoint-replicatenumber and CHP_LogFC_timepoint-replicatenumber columns, and then applied this command to the remaining cells of the column.  
#*We then created a new worksheet labeled "with_averages", in which we copy and special pasted the values of the columns with headers Control_FC_timepoint-replicatenumber and CHP_FC_timepoint-replicatenumber, as in Control_FC_0-1 and CHP_FC_0-1.
+
#*We then created a new worksheet labeled "with_averages", in which we copy and special pasted the values of the columns with headers Control_FC_timepoint-replicatenumber and CHP_FC_timepoint-replicatenumber.
 
#*We then created new columns called Control_FC_0-avg and CHP_FC_0-avg to the right of their respective t0 timepoint trials, and then within them computed the average value of the t0 timepoint trials for the control and CHP-treated data. In the first cell below the column headed Control_FC_0-avg, we used the Excel command <code>=AVG(B2:D2)</code> and then applied this command to all cells in the column. In the first cell below the column headed CHP_FC_0-avg column, we used the command <code>=AVG(F2:H2)</code> and then applied this command to all cells in the column.  
 
#*We then created new columns called Control_FC_0-avg and CHP_FC_0-avg to the right of their respective t0 timepoint trials, and then within them computed the average value of the t0 timepoint trials for the control and CHP-treated data. In the first cell below the column headed Control_FC_0-avg, we used the Excel command <code>=AVG(B2:D2)</code> and then applied this command to all cells in the column. In the first cell below the column headed CHP_FC_0-avg column, we used the command <code>=AVG(F2:H2)</code> and then applied this command to all cells in the column.  
#*We then created new columns to the right of each treatment with a column header either Control_Fold_Change_timepoint-replicatenumber or CHP_Fold_Change_timepoint-replicatenumber, as in Control_Fold_Change_3-1 or CHP_Fold_Change_3-1. We then calculated the fold change by dividing the first cell of each trial by the average t0 value for the respective treatment (control or CHP-treated), and then applying this throughout the column. To the right of each new column, we also created created columns with a column header either Control_Log2_Fold_Change_timepoint-replicatenumber or CHP_Fold_Change_timepoint-replicatenumber, as in Control_Log2_Fold_Change_3-1 or CHP_Log2_Fold_Change_3-1. We then Log2 transformed the fold changes by using <code>=LOG(cell designation, 2)</code>, where the cell designation is the first cell of the respective Control_Fold_Change_timepoint-replicatenumber or CHP_Fold_Change_timepoint-replicatenumber columns. We then applied this throughout all of the cells in the Log2 column.  
+
#*We then created new columns to the right of each treatment with a column header either Control_Fold_Change_timepoint-replicatenumber or CHP_Fold_Change_timepoint-replicatenumber, as in Control_Fold_Change_3-1 or CHP_Fold_Change_3-1. We then calculated the fold change by dividing the first cell of each trial by the average t0 value for the respective treatment (control or CHP-treated), and then applying this throughout the column. To the right of each new column, we also created created columns with a column header either Control_Log2_Fold_Change_timepoint-replicatenumber or CHP_Fold_Change_timepoint-replicatenumber, as in Control_Log2_Fold_Change_3-1 or CHP_Log2_Fold_Change_3-1. In each Log2 column, We then log2 transformed the fold changes by using <code>=LOG(cell designation, 2)</code>, where the cell designation is the first cell of the respective Control_Fold_Change_timepoint-replicatenumber or CHP_Fold_Change_timepoint-replicatenumber columns. We then applied this throughout all of the cells in each Log2 column.
  
 
===Milestone 3===
 
===Milestone 3===
 
#We created a new worksheet, naming it "CHP_ANOVA".  
 
#We created a new worksheet, naming it "CHP_ANOVA".  
 
#We copied all the Control_Log2_Fold_Change_timepoint-replicatenumber and CHP_Log2_Fold_Change_timepoint-replicatenumber columns and special pasted only the values into the new worksheet.  
 
#We copied all the Control_Log2_Fold_Change_timepoint-replicatenumber and CHP_Log2_Fold_Change_timepoint-replicatenumber columns and special pasted only the values into the new worksheet.  
#To the right of each group of either the Control or CHP trials at one timepoint, we created columns with headers in the form Avg_Control_Log_FC_timepoint or Avg_CHP_Log_FC_timepoint, where timepoint is either 3, 6, 12, 20, 40, 70, or 120.  
+
#To the right of each group of either the Control or CHP replicates at one timepoint, we created columns with headers in the form Avg_Control_Log_FC_timepoint or Avg_CHP_Log_FC_timepoint, where timepoint is either 3, 6, 12, 20, 40, 70, or 120.  
# In the cell below the Avg_Control_Log_FC_timepoint or Avg_CHP_Log_FC_timepoint headers, we typed <code>=AVERAGE()</code>, and then applied this throughout the column.
+
# In the cell below the Avg_Control_Log_FC_timepoint or Avg_CHP_Log_FC_timepoint headers, we typed =AVG(replicate 1 cell designation:replicate 3 cell designation), where the cell designations are the first cells of the replicate 1 at one timepoint and the replicate 3 at that same time point. This computed the average of replicate values, and we then applied this throughout the remaining cells in the column.  
# Repeat steps (4) through (8) with the t30, t60, t90, and the t120 data.
 
 
# To the right, we then copy and special pasted the values of each Avg_Control_Log_FC_timepoint or Avg_CHP_Log_FC_timepoint columns. To the right of these, we created columns with the the headers Control_ss_HO and CHP_ss_HO.  
 
# To the right, we then copy and special pasted the values of each Avg_Control_Log_FC_timepoint or Avg_CHP_Log_FC_timepoint columns. To the right of these, we created columns with the the headers Control_ss_HO and CHP_ss_HO.  
#In the first cell below Avg_Control_Log_FC_timepoint, we typed <code>=SUMSQ(B2,C2,D2,J2,K2,L2,R2,S2,T2,Z2,AA2,AB2,AH2,AI2,AJ2,AP2,AQ2,AR2,AX2,AY2,AZ2)</code>, and below Avg_CHP_Log_FC_timepoint, we typed <code>=SUMSQ(F2,G2,H2,N2,O2,P2,V2,W2,X2,AD2,AE2,AF2,AL2,AM2,AN2,AT2,AU2,AV2,BB2,BC2,BD2)</code>.
+
#In the first cell below Control_ss_HO, we typed <code>=SUMSQ(B2,C2,D2,J2,K2,L2,R2,S2,T2,Z2,AA2,AB2,AH2,AI2,AJ2,AP2,AQ2,AR2,AX2,AY2,AZ2)</code> and clicked enter, and below CHP_ss_HO, we typed <code>=SUMSQ(F2,G2,H2,N2,O2,P2,V2,W2,X2,AD2,AE2,AF2,AL2,AM2,AN2,AT2,AU2,AV2,BB2,BC2,BD2)</code> and clicked enter.
 +
#In the first cell below this header, we typed =SUMSQ for each of the time points.
 +
#After this, we highlighted all of the LogFC data in row 2- not the AvgLogFC, and we pressed the enter key.
 +
#We created the column headers Control_Log_FC_Time and CHP_Log_FC_Time.
 +
#We used the equation =SUMSQ(<range of cells for logFC_t0-2>)-COUNTA(<range of cells for logFC_t0-2>)*<AvgLogFC_t0-2>^2 and hit enter for each of the rows that were not the average.  
 +
#The amount of data points we have is: 21. Number of time points we have is: 7.
 +
#We replaced <range of cells for logFC_0-2> with the data range associated with t15 and so on for each time point.
 +
#We replaced <AvgCHPLogFC_0-2> and <AvgControlLogFC_0-2> with the cell number that we computed the AvgCHPLogFC and AvgControlLogFC for 0-2, and the "^2" squares that value. This was repeated for each of time points across our data set.
 +
#We created the column headers CHP_SS_full and Control_SS_full.
 +
#Below this, we used the formula: =sum(<range of cells containing "ss" for each timepoint>) and hit enter.
 +
#In the next two columns to the right, we created the headers CHP_Fstat and Control_Fstat as well as CHP_p-value and Control_p-value.
 +
#Since n=21 and t=7, in the first cell of each Fstat column, we used the formula =((21-7)/7)*(<CHP_ss_HO>-<CHP_SS_full>)/<CHP_SS_full>) and hit enter. This same process was repeated for the control.
 +
#We replaced the phrase CHP_ss_HO with the cell designation and replaced the phrase <CHP_SS_full> with the cell designation. This was also repeated for the control.
 +
#Below the CHP_p-value header, we used the formula =FDIST(<CHP_Fstat>,7,21-7) replacing the phrase <CHP_Fstat> with the cell designation. We used this same formula for the Control_p-value column.
 +
#We performed a sanity check by filtering our data so that the p value has to be less than 0.05.
 +
#Before continuing, we undid any filters before we calculated the Bonferroni and p value corrections.
 +
#We created two columns to the right called CHP_Bonferroni_p-value and Control_Bonferroni_p-value.
 +
#We used the equations =<CHP_p-value>*4697 and =<Control_p-value>*4697
 +
#
 +
 
  
 
==Acknowledgements==
 
==Acknowledgements==
This procedure was adapted from the Data Analysis page Milestone protocols, linked here: [[Data Analysis]]
+
This procedure was adapted from the Data Analysis page Milestone 1, 2, and 3 protocols, linked here: [[Data Analysis]]
 +
The procedure for Milestone 3 was also adapted from the steps outlined in the Week 9 assignment page.
 +
 
 +
Except for what is noted above, this individual journal entry was completed by Katie and Charlotte and not copied from another source.
 +
 
 +
[[User:Kmill104|Kmill104]] ([[User talk:Kmill104|talk]]) 22:50, 17 April 2024 (PDT)
 +
[[User:Ckapla12|Ckapla12]] ([[User talk:Ckapla12|talk]]) 13:59, 18 April 2024 (PDT)
  
 
==References==
 
==References==
 
LMU BioDB 2024. (2024). Week 13. Retrieved April 17, 2024 from https://xmlpipedb.cs.lmu.edu/biodb/spring2024/index.php/Week_13
 
LMU BioDB 2024. (2024). Week 13. Retrieved April 17, 2024 from https://xmlpipedb.cs.lmu.edu/biodb/spring2024/index.php/Week_13
 +
 +
LMU BioDB 2024. (2024). Data Analysis. Retrieved April 23, 2024 from https://xmlpipedb.cs.lmu.edu/biodb/spring2024/index.php/Data_Analysis
 +
 +
LMU BioDB 2024. (2024). Week 9. Retrieved April 23, 2024 from https://xmlpipedb.cs.lmu.edu/biodb/spring2024/index.php/Week_9
 +
 +
{{Yeast Beasts}}
 +
[[Category:Journal Entry]]
 +
[[Category:Team Project]]

Latest revision as of 12:30, 1 May 2024

Charlotte and Katie's Data Analyst Journal

Milestone 1

Completed as of April 11th when we gave our Journal Club Presentation with Hailey Ivanson

Milestone 2

  1. With Quality Assurance team member Hailey Ivanson, we downloaded and examined the microarray dataset: SGD Processed Data.
  2. We made a sample-data relationship table in Excel labeled "reorganized" that lists all of the samples and at which time point they were collected, and their replicate number. We came up with consistent column headers that summarize this information. We named each column either Control_LogFC_timepoint-replicatenumber or CHP_LogFC_timepoint-replicatenumber, as in Control_LogFC_0-1 and CHP_LogFC_0-1. The timepoint for each is either 0, 3 , 6, 12, 20, 40, 70, or 120, and the replicate number either 1, 2, or 3. We organized the data in a worksheet in an Excel workbook so that:
    • ID is the first column header, and within it are all of the SGD systemic names
    • Data columns are to the right, in increasing chronological order, using the column header pattern we created
    • Treatments are grouped together
    • Replicates are grouped together
    • We deleted the "EWEIGHT" row and "GWEIGHT" column.
    • We then had to undo the log-transformed raw intensity values. We first created new columns for each respective trial in the formats Control_FC_timepoint-replicatenumber and CHP_FC_timepoint-replicatenumber, as in Control_FC_0-1 and CHP_FC_0-1. We then transformed the data in the first cell of each column with the equation =2^<cell designation>, where the cell designation is the first cell of the respective Control_LogFC_timepoint-replicatenumber and CHP_LogFC_timepoint-replicatenumber columns, and then applied this command to the remaining cells of the column.
    • We then created a new worksheet labeled "with_averages", in which we copy and special pasted the values of the columns with headers Control_FC_timepoint-replicatenumber and CHP_FC_timepoint-replicatenumber.
    • We then created new columns called Control_FC_0-avg and CHP_FC_0-avg to the right of their respective t0 timepoint trials, and then within them computed the average value of the t0 timepoint trials for the control and CHP-treated data. In the first cell below the column headed Control_FC_0-avg, we used the Excel command =AVG(B2:D2) and then applied this command to all cells in the column. In the first cell below the column headed CHP_FC_0-avg column, we used the command =AVG(F2:H2) and then applied this command to all cells in the column.
    • We then created new columns to the right of each treatment with a column header either Control_Fold_Change_timepoint-replicatenumber or CHP_Fold_Change_timepoint-replicatenumber, as in Control_Fold_Change_3-1 or CHP_Fold_Change_3-1. We then calculated the fold change by dividing the first cell of each trial by the average t0 value for the respective treatment (control or CHP-treated), and then applying this throughout the column. To the right of each new column, we also created created columns with a column header either Control_Log2_Fold_Change_timepoint-replicatenumber or CHP_Fold_Change_timepoint-replicatenumber, as in Control_Log2_Fold_Change_3-1 or CHP_Log2_Fold_Change_3-1. In each Log2 column, We then log2 transformed the fold changes by using =LOG(cell designation, 2), where the cell designation is the first cell of the respective Control_Fold_Change_timepoint-replicatenumber or CHP_Fold_Change_timepoint-replicatenumber columns. We then applied this throughout all of the cells in each Log2 column.

Milestone 3

  1. We created a new worksheet, naming it "CHP_ANOVA".
  2. We copied all the Control_Log2_Fold_Change_timepoint-replicatenumber and CHP_Log2_Fold_Change_timepoint-replicatenumber columns and special pasted only the values into the new worksheet.
  3. To the right of each group of either the Control or CHP replicates at one timepoint, we created columns with headers in the form Avg_Control_Log_FC_timepoint or Avg_CHP_Log_FC_timepoint, where timepoint is either 3, 6, 12, 20, 40, 70, or 120.
  4. In the cell below the Avg_Control_Log_FC_timepoint or Avg_CHP_Log_FC_timepoint headers, we typed =AVG(replicate 1 cell designation:replicate 3 cell designation), where the cell designations are the first cells of the replicate 1 at one timepoint and the replicate 3 at that same time point. This computed the average of replicate values, and we then applied this throughout the remaining cells in the column.
  5. To the right, we then copy and special pasted the values of each Avg_Control_Log_FC_timepoint or Avg_CHP_Log_FC_timepoint columns. To the right of these, we created columns with the the headers Control_ss_HO and CHP_ss_HO.
  6. In the first cell below Control_ss_HO, we typed =SUMSQ(B2,C2,D2,J2,K2,L2,R2,S2,T2,Z2,AA2,AB2,AH2,AI2,AJ2,AP2,AQ2,AR2,AX2,AY2,AZ2) and clicked enter, and below CHP_ss_HO, we typed =SUMSQ(F2,G2,H2,N2,O2,P2,V2,W2,X2,AD2,AE2,AF2,AL2,AM2,AN2,AT2,AU2,AV2,BB2,BC2,BD2) and clicked enter.
  7. In the first cell below this header, we typed =SUMSQ for each of the time points.
  8. After this, we highlighted all of the LogFC data in row 2- not the AvgLogFC, and we pressed the enter key.
  9. We created the column headers Control_Log_FC_Time and CHP_Log_FC_Time.
  10. We used the equation =SUMSQ(<range of cells for logFC_t0-2>)-COUNTA(<range of cells for logFC_t0-2>)*<AvgLogFC_t0-2>^2 and hit enter for each of the rows that were not the average.
  11. The amount of data points we have is: 21. Number of time points we have is: 7.
  12. We replaced <range of cells for logFC_0-2> with the data range associated with t15 and so on for each time point.
  13. We replaced <AvgCHPLogFC_0-2> and <AvgControlLogFC_0-2> with the cell number that we computed the AvgCHPLogFC and AvgControlLogFC for 0-2, and the "^2" squares that value. This was repeated for each of time points across our data set.
  14. We created the column headers CHP_SS_full and Control_SS_full.
  15. Below this, we used the formula: =sum(<range of cells containing "ss" for each timepoint>) and hit enter.
  16. In the next two columns to the right, we created the headers CHP_Fstat and Control_Fstat as well as CHP_p-value and Control_p-value.
  17. Since n=21 and t=7, in the first cell of each Fstat column, we used the formula =((21-7)/7)*(<CHP_ss_HO>-<CHP_SS_full>)/<CHP_SS_full>) and hit enter. This same process was repeated for the control.
  18. We replaced the phrase CHP_ss_HO with the cell designation and replaced the phrase <CHP_SS_full> with the cell designation. This was also repeated for the control.
  19. Below the CHP_p-value header, we used the formula =FDIST(<CHP_Fstat>,7,21-7) replacing the phrase <CHP_Fstat> with the cell designation. We used this same formula for the Control_p-value column.
  20. We performed a sanity check by filtering our data so that the p value has to be less than 0.05.
  21. Before continuing, we undid any filters before we calculated the Bonferroni and p value corrections.
  22. We created two columns to the right called CHP_Bonferroni_p-value and Control_Bonferroni_p-value.
  23. We used the equations =<CHP_p-value>*4697 and =<Control_p-value>*4697


Acknowledgements

This procedure was adapted from the Data Analysis page Milestone 1, 2, and 3 protocols, linked here: Data Analysis The procedure for Milestone 3 was also adapted from the steps outlined in the Week 9 assignment page.

Except for what is noted above, this individual journal entry was completed by Katie and Charlotte and not copied from another source.

Kmill104 (talk) 22:50, 17 April 2024 (PDT) Ckapla12 (talk) 13:59, 18 April 2024 (PDT)

References

LMU BioDB 2024. (2024). Week 13. Retrieved April 17, 2024 from https://xmlpipedb.cs.lmu.edu/biodb/spring2024/index.php/Week_13

LMU BioDB 2024. (2024). Data Analysis. Retrieved April 23, 2024 from https://xmlpipedb.cs.lmu.edu/biodb/spring2024/index.php/Data_Analysis

LMU BioDB 2024. (2024). Week 9. Retrieved April 23, 2024 from https://xmlpipedb.cs.lmu.edu/biodb/spring2024/index.php/Week_9

Team Pages

Role Pages