Difference between revisions of "Knguye66 Eyoung20 Week 12/13"

Revision as of 16:38, 21 November 2019

Combined Individual Journals for Kaitlyn Nguyen and Emma Young (Data Analysts).

Purpose

The purpose of this assignment is to record our progress towards the FunGals group deliverables as the Data Analysts for this week and the future weeks to come.

Methods and Results: Progress

Progress 11/21/19

Our group decided to have the ANOVA, sanity check, and STEM set-up done BEFORE class on Thursday, 11/21/19
First created a worksheet and labeled accordingly based on the format the Coder's Guild decided
Finished the steps on Week 8 for Statistical Analysis Part I: ANOVA on Microsoft Excel for the new MicroArray Data found on the Data Analysis page
- For questions asked on the p-value (use: "out of 4468" genes instead of "out of 6189" to adjust for this data)
Following the ANOVA: Part I, Bonferroni, Benjamini & Hochberg, and p-value correction, a quick sanity check was performed for the p-value dataset.

Created a new worksheet, naming it either "S288C_ANOVA"
Copied the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for our strain and pasted it into a new worksheet. Copied the columns containing the data for our strain and pasted it into the new worksheet
- Standard Name was not in the original downloaded data, so it a third column (C) was added with the Standard Names to the genes
- Coder/Project Manager confirmed with their guild on the format of the column names (STRAIN)_(NAME)_(CONCENTRATION)_LogFC_(TIME)-(REPLICATES), ie. S288C_thiuram_75uM_LogFC_t15m-3
  - Data was copied over, with times and repeats of the experiment starting from least to greatest
At the top of the first column to the right of the data, three column headers were created of the form (STRAIN)_(NAME)_(CONCENTRATION)_AvgLogFC_(TIME) where STRAIN is your strain designation and (TIME) is 15, 30, 120.

ie. S288C_thiuram_75uM_AvgLogFC_t15m

In the cell below the S288C_thiuram_75uM_AvgLogFC_(TIME) header, typed =AVERAGE(
Then highlighted all the data in row 2 associated with t15m, pressed the closing parent key (shift 0),and pressed the "enter" key.
This cell now contained the average of the log fold change data from the first gene at t=15 minutes.
Clicked on this cell and positioned the cursor at the bottom right corner. We saw our cursor change to a thin black plus sign (not a chubby white one). When it did, we double clicked, and the formula copied to the entire column of 4468 other genes.
Repeated steps (4) through (8) with the t30m, and t120m data.
Now in the first empty column to the right of the S288C_thiuram_75uM_AvgLogFC_t120m calculation, we created the column header S288C_thiuram_75uM_ss_HO.
In the first cell below this header, typed =SUMSQ(
Highlighted all the LogFC data in row 2 (but not the AvgLogFC), pressed the closing parent key (shift 0),and pressed the "enter" key.
In the next empty column to the right of S288C_thiuram_75uM_ss_HO, created the column headers S288C_thiuram_75uM_ss_(TIME) as in (3).
Made a note of how many data points we had at each time point for our strain. Ours had 3 replicates each. Also, made a note of the total number of data points (4468).
In the first cell below the header S288C_thiuram_75uM_ss_t15, typed =SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2 and hit enter.
- The COUNTA function counted the number of cells in the specified range that had data in them (i.e., did not count cells with missing values).
- The phrase <range of cells for logFC_t15> was replaced by the data range associated with t15m.
- The phrase <AvgLogFC_t15> was replaced by the cell number in which we computed the AvgLogFC for t15m, and the "^2" squares that value.
- Upon completion of this single computation, used the Step (7) trick to copy the formula throughout the column.
Repeated this computation for the t30m through t120m data points.
In the first column to the right of S288C_thiuram_75uM_ss_t120m, created the column header S288C_thiuram_75uM_SS_full.
In the first row below this header, typed =sum(<range of cells containing "ss" for each timepoint>) and hit enter.
In the next two columns to the right, created the headers S288C_thiuram_75uM_Fstat and S288C_thiuram_75uM_p-value.
Recalled the number of data points from (13): called that total n.
In the first cell of the S288C_thiuram_75uM_Fstat column, typed =((n-3)/3)*(<(S288C_thiuram_75uM_ss_HO>-<(S288C_thiuram_75uM_SS_full>)/<(S288C_thiuram_75uM_SS_full> and hit enter.
- n =9. "3" is the number of timepoints (ie. t15m, t30m, t120m)
- Replaced the phrase S288C_thiuram_75uM_ss_HO with the cell designation.
- Replaced the phrase <S288C_thiuram_75uM_SS_full> with the cell designation.
- Copied to the whole column.
In the first cell below the S288C_thiuram_75uM_p-value header, typed =FDIST(<(S288C_thiuram_75uM_Fstat>,3,9-3) replacing the phrase <(STRAIN)_Fstat> with the cell designation and the "n" as in (13) with the number of data points total.
Before we moved on to the next step, we will perform a quick sanity check to see if we did all of these computations correctly.
- Click on cell A1 and click on the Data tab. Select the Filter icon (looks like a funnel). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
- Click on the drop-down arrow on your (STRAIN)_p-value column. Select "Number Filters". In the window that appears, set a criterion that will filter your data so that the p value has to be less than 0.05.
- Excel will now only display the rows that correspond to data meeting that filtering criterion. A number will appear in the lower left hand corner of the window giving you the number of rows that meet that criterion. We will check our results with each other to make sure that the computations were performed correctly.
- Be sure to undo any filters that you have applied before making any additional calculations.

Sanity Check Questions:

-Unadjusted p-value-

How many genes have p<0.05? and what is the percentage (out of 4468)?
- 1196
How many genes have p<0.01? and what is the percentage (out of 4468)?
- 731
How any genes have p<0.001? and what is the percentage (out of 4468)?
- 380
How many genes have p<0.0001? and what is the percentage (out of 4468)?
- 190

-Bonferroni & Benjamini and Hochberg p-value-

How many genes are p<0.05 for the Bonferroni-corrected p-value? and what is the percentage (out of 4468)?
- 103, 2.3%
How many genes are p <0.05 for the Benjamini and Hochberg-corrected p-value? and what is the percentage (out of 4468)?
- 677, 15.2%

Microarray data was prepared to be loaded into the STEM software
A new worksheet was added into the Excel workbook, and named "Thiuram_stem".
Then all of the data from your "Thiuram_ANOVA" worksheet was Paste special > paste values into the "Thiuram_stem" worksheet.
- The leftmost column had the column header "Master_Index". This was renamed to "SPOT".
- Column B that says "ID" was renamed to "Gene Symbol". There was no column for standard name present on the data given.
The data was then filtered on the B-H corrected p-value to be > 0.05
- Once the data was filtered, we selected all of the rows (except for your header row) and deleted the rows by right-clicking and choosing "Delete Row" from the context menu. the filter was undone. This then ensured that we will cluster only the genes with a "significant" change in expression and not the noise.
Deleted all of the data columns EXCEPT for the Average Log Fold change columns for each timepoint.
- Renamed the data columns with just the time and units (for example, 15m, 30m, etc.).
  - Saved work.
An error was found in the anova results so the process is being repeated.

Conclusion

The first stage of our group's project was completed via referencing Week 8 and using Microsoft Excel to complete the tasks. The excel file will be located in the FunGals page for viewing and download.

Acknowledgements

This section is in acknowledgement to partner Kaitlyn Nguyen (User:knguye66), Michael Armas (User:Marmas), as well as, Iliana Crespin (User:Icrespin), and Emma Young (User:eyoung20). We would also like to acknowledge Dr. Dahlquist (User:KDahlquist) for introducing and teaching the topic and direction of this assignment.

"Except for what is noted above, this individual journal entry was completed by me and not copied from another source." Knguye66 (talk) 18:49, 20 November 2019 (PST)

References

Dahlquist, K. (2019, November 19). Data Analysis. In Wikipedia, Biological Databases. Retrieved 6:25, November 20, 2019, from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Data_Analysis
Dahlquist, K. (2019, November 20). Final Project Deliverables. In Wikipedia, Biological Databases. Retrieved 6:25, November 20, 2019, from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_12/13https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Final_Project_Deliverables
Dahlquist, K. (2019, November 19). Week 12/13. In Wikipedia, Biological Databases. Retrieved 6:25, November 20, 2019, from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_12/13
Dahlquist, K. (2019, October 17). Week 8. In Wikipedia, Biological Databases. Retrieved 6:30, October 21, 2019, from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_8

User Page

User:knguye66

Template Page

Template:knguye66

Table of all assignments and journal entries for BIO-367-01

Week	Individual Journal Entry	Shared Journal
Week 1	-	Class Journal Week 1
Week 2	knguye66 Week 2	Class Journal Week 2
Week 3	ILT1/YDR090C Week 3	Class Journal Week 3
Week 4	knguye66 Week 4	Class Journal Week 4
Week 5	DrugCentral Week 5	Class Journal Week 5
Week 6	knguye66 Week 6	Class Journal Week 6
Week 7	knguye66 Week 7	Class Journal Week 7
Week 8	knguye66 Week 8	Class Journal Week 8
Week 9	knguye66 Week 9	Class Journal Week 9
Week 10	knguye66 Week 10	Class Journal Week 10
Week 11	knguye66 Week 11	FunGals
Week 12/13	knguye66 Eyoung20 Week 12/13	FunGals
Week 15	knguye66 Eyoung20 Week 15	Class Journal Week 15

Eyoung20 user page

Assignment pages	Individual Journal	Class Journal
week 1	Eyoung20 journal week 1	Class Journal Week 1
week 2	Eyoung20 journal week 2	Class Journal Week 2
week 3	ASP1/YDR321W Week 3	Class Journal Week 3
week 4	Eyoung20 journal week 4	Class Journal Week 4
week 5	Ancient mtDNA Week 5	Class Journal Week 5
week 6	Eyoung20 journal week 6	Class Journal Week 6
week 7	Eyoung20 journal week 7	Class Journal Week 7
week 8	Eyoung20 journal week 8	Class Journal Week 8
week 9	Eyoung20 journal week 9	Class Journal Week 9
week 10	Eyoung20 journal week 10	Class Journal Week 10
week 11	Eyoung20 journal week 11	FunGals
week 12/13	Knguye66 Eyoung20 Week 12/13	FunGals
week 15	Knguye66 Eyoung20 Week 15	FunGals

@@ Line 16: / Line 16: @@
 # Copied the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for our strain and pasted it into a new worksheet.  Copied the columns containing the data for our strain and pasted it into the new worksheet
 #* Standard Name was not in the original downloaded data, so it a third column (C) was added with the Standard Names to the genes
-#* Coder/Project Manager confirmed with their guild on the format of the column names (STRAIN)_(NAME)_(CONCENTRATION)LogFC_(TIME)-(REPLICATES), ie. S288C_thiuram_75uM_LogFC_t15m-3
+#* Coder/Project Manager confirmed with their guild on the format of the column names (STRAIN)_(NAME)_(CONCENTRATION)_LogFC_(TIME)-(REPLICATES), ie. S288C_thiuram_75uM_LogFC_t15m-3
 #** Data was copied over, with times and repeats of the experiment starting from least to greatest
-# At the top of the first column to the right of the data, five column headers were created of the form (STRAIN)_(NAME)_(CONCENTRATION)AvgLogFC_(TIME) where STRAIN is your strain designation and (TIME) is 15, 30, etc. S288C_thiuram_75uM_LogFC_t15m-3
+# At the top of the first column to the right of the data, three column headers were created of the form (STRAIN)_(NAME)_(CONCENTRATION)_AvgLogFC_(TIME) where STRAIN is your strain designation and (TIME) is 15, 30, 120.
-# In the cell below the (STRAIN)_AvgLogFC_t15 header, type <code>=AVERAGE(</code>
+ie. S288C_thiuram_75uM_AvgLogFC_t15m
-# Then highlight all the data in row 2 associated with t15, press the closing paren key (shift 0),and press the "enter" key.
+# In the cell below the S288C_thiuram_75uM_AvgLogFC_(TIME) header, typed <code>=AVERAGE(</code>
-# This cell now contains the average of the log fold change data from the first gene at t=15 minutes.
+# Then highlighted all the data in row 2 associated with t15m, pressed the closing parent key (shift 0),and pressed the "enter" key.
-# Click on this cell and position your cursor at the bottom right corner. You should see your cursor change to a thin black plus sign (not a chubby white one). When it does, double click, and the formula will magically be copied to the entire column of 6188 other genes.
+# This cell now contained the average of the log fold change data from the first gene at t=15 minutes.
-# Repeat steps (4) through (8) with the t30, t60, t90, and the t120 data.
+# Clicked on this cell and positioned the cursor at the bottom right corner. We saw our cursor change to a thin black plus sign (not a chubby white one). When it did, we double clicked, and the formula copied to the entire column of 4468 other genes.
-# Now in the first empty column to the right of the (STRAIN)_AvgLogFC_t120 calculation, create the column header (STRAIN)_ss_HO.
+# Repeated steps (4) through (8) with the t30m, and t120m data.
-# In the first cell below this header, type <code>=SUMSQ(</code>
+# Now in the first empty column to the right of the S288C_thiuram_75uM_AvgLogFC_t120m calculation, we created the column header S288C_thiuram_75uM_ss_HO.
-# Highlight all the LogFC data in row 2 (but not the AvgLogFC), press the closing paren key (shift 0),and press the "enter" key.
+# In the first cell below this header, typed <code>=SUMSQ(</code>
-# In the next empty column to the right of (STRAIN)_ss_HO, create the column headers (STRAIN)_ss_(TIME) as in (3).
+# Highlighted all the LogFC data in row 2 (but not the AvgLogFC), pressed the closing parent key (shift 0),and pressed the "enter" key.
-# Make a note of how many data points you have at each time point for your strain.  For most of the strains, it will be 4, but for dHAP4 t90 or t120, it will be "3", and for the wild type it will be "4" or "5".  Count carefully. Also, make a note of the total number of data points. Again, for most strains, this will be 20, but for example, dHAP4, this number will be 18, and for wt it should be 23 (double-check).
+# In the next empty column to the right of S288C_thiuram_75uM_ss_HO, created the column headers S288C_thiuram_75uM_ss_(TIME) as in (3).
-# In the first cell below the header (STRAIN)_ss_t15, type <code>=SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2</code> and hit enter.
+# Made a note of how many data points we had at each time point for our strain.  Ours had 3 replicates each. Also, made a note of the total number of data points (4468).
-#* The <code>COUNTA</code> function counts the number of cells in the specified range that have data in them (i.e., does not count cells with missing values).
+# In the first cell below the header S288C_thiuram_75uM_ss_t15, typed <code>=SUMSQ(<range of cells for logFC_t15>)-COUNTA(<range of cells for logFC_t15>)*<AvgLogFC_t15>^2</code> and hit enter.
-#* The phrase <range of cells for logFC_t15> should be replaced by the data range associated with t15.
+#* The <code>COUNTA</code> function counted the number of cells in the specified range that had data in them (i.e., did not count cells with missing values).
-#* The phrase <AvgLogFC_t15> should be replaced by the cell number in which you computed the AvgLogFC for t15, and the "^2" squares that value.
+#* The phrase <range of cells for logFC_t15> was replaced by the data range associated with t15m.
-#* Upon completion of this single computation, use the Step (7) trick to copy the formula throughout the column.
+#* The phrase <AvgLogFC_t15> was replaced by the cell number in which we computed the AvgLogFC for t15m, and the "^2" squares that value.
-# Repeat this computation for the t30 through t120 data points.  Again, be sure to get the data for each time point, type the right number of data points, and get the average from the appropriate cell for each time point, and copy the formula to the whole column for each computation.
+#* Upon completion of this single computation, used the Step (7) trick to copy the formula throughout the column.
-# In the first column to the right of (STRAIN)_ss_t120, create the column header (STRAIN)_SS_full.
+# Repeated this computation for the t30m through t120m data points.
-# In the first row below this header, type <code>=sum(<range of cells containing "ss" for each timepoint>)</code> and hit enter.
+# In the first column to the right of S288C_thiuram_75uM_ss_t120m, created the column header S288C_thiuram_75uM_SS_full.
-# In the next two columns to the right, create the headers (STRAIN)_Fstat and (STRAIN)_p-value.
+# In the first row below this header, typed <code>=sum(<range of cells containing "ss" for each timepoint>)</code> and hit enter.
-# Recall the number of data points from (13): call that total n.
+# In the next two columns to the right, created the headers S288C_thiuram_75uM_Fstat and S288C_thiuram_75uM_p-value.
-# In the first cell of the (STRAIN)_Fstat column, type <code>=((n-5)/5)*(<(STRAIN)_ss_HO>-<(STRAIN)_SS_full>)/<(STRAIN)_SS_full></code> and hit enter.
+# Recalled the number of data points from (13): called that total n.
-#* Don't actually type the n but instead use the number from (13). Also note that "5" is the number of timepoints.<!-- and the dSWI4 strain has 4 timepoints (it is missing t15).-->
+# In the first cell of the S288C_thiuram_75uM_Fstat column, typed <code>=((n-3)/3)*(<(S288C_thiuram_75uM_ss_HO>-<(S288C_thiuram_75uM_SS_full>)/<(S288C_thiuram_75uM_SS_full></code> and hit enter.
-#* Replace the phrase (STRAIN)_ss_HO with the cell designation.
+#* n =9. "3" is the number of timepoints (ie. t15m, t30m, t120m)
-#* Replace the phrase <(STRAIN)_SS_full> with the cell designation.
+#* Replaced the phrase S288C_thiuram_75uM_ss_HO with the cell designation.
-#* Copy to the whole column.
+#* Replaced the phrase <S288C_thiuram_75uM_SS_full> with the cell designation.
-# In the first cell below the (STRAIN)_p-value header, type <code>=FDIST(<(STRAIN)_Fstat>,5,n-5)</code> replacing the phrase <(STRAIN)_Fstat> with the cell designation and the "n" as in (13) with the number of data points total. <!--(Again, note that the number of timepoints is actually "4" for the dSWI4 strain)-->.  Copy to the whole column.
+#* Copied to the whole column.
-# Before we move on to the next step, we will perform a quick sanity check to see if we did all of these computations correctly.
+# In the first cell below the S288C_thiuram_75uM_p-value header, typed <code>=FDIST(<(S288C_thiuram_75uM_Fstat>,3,9-3)</code> replacing the phrase <(STRAIN)_Fstat> with the cell designation and the "n" as in (13) with the number of data points total.
+# Before we moved on to the next step, we will perform a quick sanity check to see if we did all of these computations correctly.
 #*  Click on cell A1 and click on the Data tab.  Select the Filter icon (looks like a funnel). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
 #* Click on the drop-down arrow on your (STRAIN)_p-value column. Select "Number Filters". In the window that appears, set a criterion that will filter your data so that the p value has to be less than 0.05.

Difference between revisions of "Knguye66 Eyoung20 Week 12/13"

Revision as of 16:38, 21 November 2019

Contents

Purpose

Methods and Results: Progress

Progress 11/21/19

Conclusion

Acknowledgements

References

User Page

Template Page

Table of all assignments and journal entries for BIO-367-01

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools