Knguye66 Eyoung20 Week 12/13
Combined Individual Journals for Kaitlyn Nguyen and Emma Young (Data Analysts).
Contents
Purpose
The purpose of this assignment is to record our progress towards the FunGals group deliverables as the Data Analysts for this week and the future weeks to come. The purpose of week 12 specifically was to download and adapt the data to the formatting we need for analysis. Then to begin the analysis with ANOVA and preparations for STEM.
Methods and Results: Progress
Progress 11/21/19
- Our group decided to have the ANOVA, sanity check, and STEM set-up done BEFORE class on Thursday, 11/21/19
- First created a worksheet and labeled accordingly based on the format the Coder's Guild decided
- Finished the steps on Week 8 for Statistical Analysis Part I: ANOVA on Microsoft Excel for the new MicroArray Data found on the Data Analysis page
- For questions asked on the p-value (use: "out of 4467" genes instead of "out of 6189" to adjust for this data)
- Following the ANOVA: Part I, Bonferroni, Benjamini & Hochberg, and p-value correction, a quick sanity check was performed for the p-value dataset.
- Created a new worksheet, naming it either "S288C_ANOVA"
- Copied the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for our strain and pasted it into a new worksheet. Copied the columns containing the data for our strain and pasted it into the new worksheet
- Standard Name was not in the original downloaded data, so it a third column (C) was added with the Standard Names to the genes
- Coder/Project Manager confirmed with their guild on the format of the column names (STRAIN)_(NAME)_(CONCENTRATION)_LogFC_(TIME)-(REPLICATES), ie. S288C_thiuram_75uM_LogFC_t15m-3
- Data was copied over, with times and repeats of the experiment starting from least to greatest
- At the top of the first column to the right of the data, three column headers were created of the form (STRAIN)_(NAME)_(CONCENTRATION)_AvgLogFC_(TIME) where STRAIN is your strain designation and (TIME) is 15, 30, 120. ie. S288C_thiuram_75uM_AvgLogFC_t15m
- In the cell below the S288C_thiuram_75uM_AvgLogFC_(TIME) header, typed
=AVERAGE(
- Then highlighted all the data in row 2 associated with t15m, pressed the closing parent key (shift 0),and pressed the "enter" key.
- This cell now contained the average of the log fold change data from the first gene at t=15 minutes.
- Clicked on this cell and positioned the cursor at the bottom right corner. We saw our cursor change to a thin black plus sign (not a chubby white one). When it did, we double clicked, and the formula copied to the entire column of 4467 other genes.
- Repeated steps (4) through (8) with the t30m, and t120m data.
- Now in the first empty column to the right of the S288C_thiuram_75uM_AvgLogFC_t120m calculation, we created the column header S288C_thiuram_75uM_ss_HO.
- In the first cell below this header, typed
=SUMSQ(
- Highlighted all the LogFC data in row 2 (but not the AvgLogFC), pressed the closing parent key (shift 0),and pressed the "enter" key.
- In the next empty column to the right of S288C_thiuram_75uM_ss_HO, created the column headers S288C_thiuram_75uM_ss_(TIME) as in (3).
- Made a note of how many data points we had at each time point for our strain. Ours had 3 replicates each. Also, made a note of the total number of data points (4468).
- In the first cell below the header S288C_thiuram_75uM_ss_t15, typed
=SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2
and hit enter.- The
COUNTA
function counted the number of cells in the specified range that had data in them (i.e., did not count cells with missing values). - The phrase <range of cells for logFC_t15> was replaced by the data range associated with t15m.
- The phrase <AvgLogFC_t15> was replaced by the cell number in which we computed the AvgLogFC for t15m, and the "^2" squares that value.
- Upon completion of this single computation, used the Step (7) trick to copy the formula throughout the column.
- The
- Repeated this computation for the t30m through t120m data points.
- In the first column to the right of S288C_thiuram_75uM_ss_t120m, created the column header S288C_thiuram_75uM_SS_full.
- In the first row below this header, typed
=sum(<range of cells containing "ss" for each timepoint>)
and hit enter. - In the next two columns to the right, created the headers S288C_thiuram_75uM_Fstat and S288C_thiuram_75uM_p-value.
- Recalled the number of data points from (13): called that total n.
- In the first cell of the S288C_thiuram_75uM_Fstat column, typed
=((n-3)/3)*(<(S288C_thiuram_75uM_ss_HO>-<(S288C_thiuram_75uM_SS_full>)/<(S288C_thiuram_75uM_SS_full>
and hit enter.- n =9. "3" is the number of timepoints (ie. t15m, t30m, t120m)
- Replaced the phrase S288C_thiuram_75uM_ss_HO with the cell designation.
- Replaced the phrase <S288C_thiuram_75uM_SS_full> with the cell designation.
- Copied to the whole column.
- In the first cell below the S288C_thiuram_75uM_p-value header, typed
=FDIST(<(S288C_thiuram_75uM_Fstat>,3,9-3)
replacing the phrase <(S288C_thiuram_75uM_Fstat> with the cell designation and the "n" with - Before we moved on to the next step, we performed a quick sanity check to see if we did all of these computations correctly.
- Clicked on cell A1 and click on the Data tab. Selected the Filter icon (looks like a funnel). Little drop-down arrows appeared at the top of each column. This enabled us to filter the data according to criteria we set.
- Clicked on the drop-down arrow on your S288C_thiuram_75uM_p-value column. Selected "Number Filters". In the window that appeared, we set a criterion that filter our data so that the p-value had to be less than 0.05.
- Before continuing the next steps, filters were undone.
- We performed adjustments to the p-value to correct for the multiple testing problem. Labeled the next two columns to the right with the same label, S288C_thiuram_75uM_Bonferroni_p-value.
- Type the equation
=<S288C_thiuram_75uM_p-value>*4467
, Upon completion of this single computation, used the Step (10) trick to copy the formula throughout the column. - Replaced any corrected p-value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second S288C_thiuram_75uM_Bonferroni_p-value header:
=IF((STRAIN)_Bonferroni_p-value>1,1,(STRAIN)_Bonferroni_p-value)
, where "S288C_thiuram_75uM_Bonferroni_p-value" refers to the cell in which the first Bonferroni p-value computation was made. Used the Step (10) trick to copy the formula throughout the column. - Inserted a new worksheet named "S288C_thiuram_75uM_ANOVA_B-H".
- Copied and pasted the "MasterIndex", "ID", and "Standard Name" columns from our previous worksheet into the first two columns of the new worksheet.
- For the following, used Paste special > Paste values. Copied our unadjusted p-values from our ANOVA worksheet and pasted it into Column D.
- Selected all of columns A, B, C, and D. Sorted by ascending values on Column D. Clicked the sort button from A to Z on the toolbar, in the window that appeared, sorted by column D, smallest to largest.
- Typed the header "Rank" in cell E1. We created a series of numbers in ascending order from 1 to 4467 in this column. This is the p value rank, smallest to largest. Typed "1" into cell E2 and "2" into cell E3. Selected both cells E2 and E3. Double-click on the plus sign on the lower right-hand corner of your selection to fill the column with a series of numbers from 1 to 6189.
- Calculated the Benjamini and Hochberg p value correction by typing S288C_thiuram_75uM_B-H_p-value in cell F1. Typed the following formula in cell F2:
=(D2*4467)/E2
and pressed enter. Copied that equation to the entire column. - Typed "S288C_thiuram_75uM_B-H_p-value" into cell G1.
- Typed the following formula into cell G2:
=IF(F2>1,1,F2)
and press enter. Copied that equation to the entire column. - Selected columns A through G. Sorted them by your MasterIndex in Column A in ascending order.
- Copied column G and used Paste special > Paste values to paste it into the next column on the right of our ANOVA sheet.
Sanity Check Questions:
-Unadjusted p-value-
- How many genes have p<0.05? and what is the percentage (out of 4467)?
- 1662 : 37.2%
- How many genes have p<0.01? and what is the percentage (out of 4467)?
- 811 : 18.16%
- How any genes have p<0.001? and what is the percentage (out of 4467)?
- 225 : 5.04%
- How many genes have p<0.0001? and what is the percentage (out of 4467)?
- 39 : 0.87%
-Bonferroni & Benjamini and Hochberg p-value-
- How many genes are p<0.05 for the Bonferroni-corrected p-value? and what is the percentage (out of 4467)?
- 5 , 0.11%
- How many genes are p <0.05 for the Benjamini and Hochberg-corrected p-value? and what is the percentage (out of 4467)?
- 731, 16.36%
- Microarray data was prepared to be loaded into the STEM software
- A new worksheet was added into the Excel workbook, and named "Thiuram_stem".
- Then all of the data from your "Thiuram_ANOVA" worksheet was Paste special > paste values into the "Thiuram_stem" worksheet.
- The leftmost column had the column header "Master_Index". This was renamed to "SPOT".
- Column B that says "ID" was renamed to "Gene Symbol". There was no column for standard name present on the data given.
- The data was then filtered on the B-H corrected p-value to be > 0.05
- Once the data was filtered, we selected all of the rows (except for your header row) and deleted the rows by right-clicking and choosing "Delete Row" from the context menu. the filter was undone. This then ensured that we will cluster only the genes with a "significant" change in expression and not the noise.
- Deleted all of the data columns EXCEPT for the Average Log Fold change columns for each timepoint.
- Renamed the data columns with just the time and units (for example, 15m, 30m, etc.).
- Saved work.
- Renamed the data columns with just the time and units (for example, 15m, 30m, etc.).
- An error was found in the anova results so the process is being repeated.
- there was too few results in the repeated stem analysis, only 6 results.
- The third try was successful and resulted in 731 gene entries that fit with the results of the Benjamini and Hochberg-corrected p-value sanity check results.
Conclusion
This final project was very time-consuming and quite tough, but it has been fun to analyze the microarray data and pick out faults and errors that the original experiment failed to account for.
Data and files
Acknowledgements
This section is in acknowledgement to partner Kaitlyn Nguyen (User:knguye66), Michael Armas (User:Marmas), as well as, Iliana Crespin (User:Icrespin), and Emma Young (User:eyoung20). We would also like to acknowledge Dr. Dahlquist (User:KDahlquist) for introducing and teaching the topic and direction of this assignment. Also to acknowledge that this is a shared electronic notebook between Kaitlyn Nguyen and Emma Young.
"Except for what is noted above, this individual journal entry was completed by me and not copied from another source." Knguye66 (talk) 18:49, 20 November 2019 (PST)
"Except for what is noted above, this individual journal entry was completed by me and not copied from another source." Eyoung20 (talk) 16:40, 25 November 2019 (PST)
References
- Dahlquist, K. (2019, November 19). Data Analysis. In Wikipedia, Biological Databases. Retrieved 6:25, November 20, 2019, from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Data_Analysis
- Dahlquist, K. (2019, November 20). Final Project Deliverables. In Wikipedia, Biological Databases. Retrieved 6:25, November 20, 2019, from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_12/13https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Final_Project_Deliverables
- Dahlquist, K. (2019, November 19). Week 12/13. In Wikipedia, Biological Databases. Retrieved 6:25, November 20, 2019, from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_12/13
- Dahlquist, K. (2019, October 17). Week 8. In Wikipedia, Biological Databases. Retrieved 6:30, October 21, 2019, from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_8
User Page
User:knguye66
Template Page
Template:knguye66