Difference between revisions of "Kzebrows Week 14"
(Running GenMAPP and MAPPFinder) |
(→Electronic Lab Notebook: upload of links to files.) |
||
Line 78: | Line 78: | ||
*I selected columns A through F and sorted them by MasterIndex Column A in ascending order. | *I selected columns A through F and sorted them by MasterIndex Column A in ascending order. | ||
*I copied column F and pasted values into the right column of the Stats sheet. | *I copied column F and pasted values into the right column of the Stats sheet. | ||
+ | |||
+ | [[Media:CompiledRaw data RPRX IDLR EYKZ2015123.xlsx | RX Compiled Raw Data 12/3]] | ||
+ | |||
+ | [[Media:CompiledRaw data RPRX IDLR EYKZ2015123.txt | RX .txt format GenMAPP ready 12/3]] | ||
====Running GenMAPP and MAPPFinder==== | ====Running GenMAPP and MAPPFinder==== | ||
Line 92: | Line 96: | ||
I then went back to my Compiled Raw Data file to make the necessary changes and perform a re-export for GenMAPP. In the "ForGenMAPP" file I replaced all periods in the headings with the word "pt" for 21 total replacements. Then, I replaced all "#DIV/0!" with a space character which yielded 187 replacements. I saved this file as an Excel file and then also as a tab-delimited (.txt) file. I chose this new dataset in the Expression Dataset Manager and clicked "OK" to convert the raw expression data and generate the exceptions file. 416 errors were detected in the raw data. | I then went back to my Compiled Raw Data file to make the necessary changes and perform a re-export for GenMAPP. In the "ForGenMAPP" file I replaced all periods in the headings with the word "pt" for 21 total replacements. Then, I replaced all "#DIV/0!" with a space character which yielded 187 replacements. I saved this file as an Excel file and then also as a tab-delimited (.txt) file. I chose this new dataset in the Expression Dataset Manager and clicked "OK" to convert the raw expression data and generate the exceptions file. 416 errors were detected in the raw data. | ||
+ | |||
+ | [[Media:CompiledRaw data RPRX IDLR EYKZ2015126.xlsx | RX Compiled Raw Data as of 12/6]] | ||
+ | |||
+ | [[Media:CompiledRaw data RPRX IDLR EYKZ2015126.txt | RX .txt format updated as of 12/6]] | ||
+ | |||
+ | [[Media:CompiledRaw data RPRX IDLR EYKZ2015126.EX.txt | RX Exceptions file]] | ||
+ | |||
+ | [[Media:CompiledRaw data RPRX IDLR EYKZ2015126.gex | RX .gex file]] | ||
+ | |||
+ | |||
====Instructions from Dr. Dahlquist==== | ====Instructions from Dr. Dahlquist==== |
Revision as of 23:32, 6 December 2015
Contents
- 1 Electronic Lab Notebook
- 1.1 Corrections from talk page
- 1.2 Normalizing the log ratios for the set of slides in the experiment
- 1.3 Perform statistical analysis on the ratios
- 1.4 Calculate Bonferroni p value correction
- 1.5 Analysis corrections
- 1.6 Prepare file for GenMAPP
- 1.7 Sanity Check
- 1.8 Benjamini & Hochberg p value correction
- 1.9 Running GenMAPP and MAPPFinder
- 1.10 Instructions from Dr. Dahlquist
Electronic Lab Notebook
Corrections from talk page
- Downloaded the most recent version of the file from the OTS Files page
- Renamed all columns by replacing LR with LogFC
- Re-named Sheet 1 "CompiledRawData" and copied all of the data from it and pasted it into Sheet 2, which I re-named "MasterSheet":
- Deleted all ID columns except for column A, which I re-named "ID"
- Inserted a column to the left of column B and re-named it "MasterIndex"
- Typed "1" and cell B2 and 2 and Cell B3 and selected both cells. I double-clicked on the + sign to fill the entire column with numbers 1-4208.
- Selected the data and sorted A-->Z on the "ID" column
- Deleted rows that have "Empty" or "Blank####" ID. This left me with 3,926 files (3,927 minus 1 header row).
- Sorted by MasterIndex column to put IDs back in original order, smallest to largest.
- Replaced "Error" with nothing and got 585 replacements.
- Copy and pasted data into Sheet 3, which I re-named "ScalingCentering".
The next portion of the assignment was done by following the instructions in Sample Microarray Analysis Vibrio cholerae.
Normalizing the log ratios for the set of slides in the experiment
- I inserted a new worksheet and named it Scaled_Centered
- I copied all data from the MasterSheet and pasted it into cell A1 of the Scaled_Centered sheet
- I inserted two rows in between the top row of headers and first data row. I named cell A2 "Average" and named cell A3 "StdDev".
- In cell C2 I typed =AVERAGE(C4:C3929) and in cell C3 I typed =STDEV(C4:C3929). I pressed enter and copied this equation across the rest of the columns through column AL.
- I then copied the column headings for all data columns and pasted them to the right of the last column. Using the copy/paste tool I renamed each column with "_Scaled_Centered" at the end.
- In cell AM4 I typed =(C4-C$2)/C$3 indicating that I wanted data in cell C4 to have the average subtracted from it and then to divide it by the standard deviation. I used the "$" sign to indicate that I did not want the average and standard deviation values to change even when the equation was pasted for the entire column of genes.
- I then copy and pasted that equation across the entire column by clicking on the original cell and double-clicking on the black plus sign. I copy and pasted this equation for each column of the data.
Perform statistical analysis on the ratios
- I inserted a new worksheet and named it "Statistics"
- I copied and pasted the ID column from the ScalingCentering worksheet into the first column of the new worksheet
- I copied all Scaled_Centered columns from the ScalingCentering worksheet and pasted the values into column B1 of the new sheet
- Deleted "Average" and "StDev" columns
- Decided to divide and conquer with the rest of the statistical testing. I did RX and Erich did RP. I deleted Erich's RP columns to make my calculations easier.
- Inserted column to the right and typed headers Avg_LogFC RX-0.5-10, Avg_LogFC RX-0.5-30, Avg_LogFC RX-0.5-60, Avg_LogFC RX-1-10, Avg_LogFC RX-1-30, and Avg_LogFC RX-1-60 into the top cells of the next 6 columns.
- Computed the average log fold change per treatment and time period by typing the following equations: =AVERAGE(B2:D2), =AVERAGE(E2:G2), =AVERAGE(H2:J2), =AVERAGE(K2:M2), =AVERAGE(N2:P2), and =AVERAGE(Q2:S2) below each corresponding Avg_LogFC column. I copied and pasted these equations for the whole column.
- I then inserted a new column to the right and named it "Avg_LogFC_all" and typed the equation =AVERAGE(T2:Y2).
- I inserted a new column next to Avg_LogFC_all and labeled it "Tstat". Into column AA2 I typed =AVERAGE(T2:Y2)/(STDEV(T2:Y2)/SQRT(6)), indicating 6 replicates. I copied this equation into the whole column.
- I inserted a new column and called it "Pvalue". I entered the equation =TDIST(ABS(AA2),5,2) for AA2, the T-stat column and 5 degrees of freedom.
Calculate Bonferroni p value correction
- I inserted a new column to the right with the label Bonferroni_Pvalue (did this twice)
- In the first Bonferroni_Pvalue column I typed =AB2*3926 and copied this equation for the entire column.
- In order to replace any corrected P value that was greater than 1 with the number 1, I typed the equation =IF(AC2>1,1,AC2) and pasted it through the entire second Bonferroni_Pvalue column.
- I then saved the most updated version of the file to my ThawSpace to resume working on it on Thursday.
Analysis corrections
After uploading my 12/1 file to the OTS Files page and consulting with Dr. Dahlquist, I realized that I needed to make several corrections. Originally, I calculated the T-stat, Pvalue, Bonferroni P value, and adjusted Bonferroni P value using a LogFC_all column, which was incorrect. Because this experiment involved multiple treatments (of which I was analyzing 6 for the RX samples) I needed to calculate those four things for all 6 treatments (RX-0.5-10, RX-0.5-30, RX-0.5-60, RX-1-10, RX-1-30, and RX-1-60). I inserted columns for all four items for all 6 treatments I was looking at, and following the same procedure as before, I calculated the T-stat, P value, Bonferroni P value, and adjusted Bonferroni P value for each treatment. Here are some sample equations for the RX-0.5-10 treatment:
- Column Z named Tstat_LogFC RX-0.5-10: =AVERAGE(B2:D2)/(STDEV(B2:D2)/SQRT(3)), denoting that I want a T statistic representative of all three RX-0.5-10 replicates.
- Column AA named Pvalue_LogFC RX-0.5-10: =TDIST(ABS(Z2),2,2), indicating that i wanted the P value calculated from the T statistic column.
- Column AB named Bonferroni_Pvalue_LogFC RX-0.5-10: =AA2*3926
- Column AC named Bonferroni_Pvalue_LogFC RX-0.5-10: =IF(AB2>1,1,AB2) in reference to the first Bonferroni P value column.
Dr Dahlquist also made a change to the file where she replaced all of the nothing boxes with a space/character/space. The amount of replacements was still the same as before but this was done to insure that the analysis was as accurate as possible. I proceeded to do a sanity check before moving on to the Benjamini-Hochberg corrections.
Prepare file for GenMAPP
- I inserted a new worksheet and named it "forGenMAPP."
- I selected all from the Statistics worksheet and pasted the values on the new sheet.
- I selected all fold changes and changed them to 2 decimal places by selecting Format > Cells.
- I selected all columns with p values including the T stat column and changed them to 4 decimal places by selecting Format > Cells.
- I deleted the left-most Bonferroni p value columns (the ones without the "IF" statements) from each treatment.
- I inserted a column to the right of the ID column and named it "SystemCode". I filled the whole column with the letter N by typing N in the first cell after the heading and double-clicking on the little black cross.
- I selected File > Save As "Text (Tab-delimited) (*.txt). I clicked through the different warnings and uploaded both files to the team's Wiki page.
Sanity Check
Next I performed a sanity check to ensure that I performed the data analysis for RX-treated samples correctly. I opened the "forGenMAPP" tab and selected "Custom Filter" from the filter options. I then performed the following procedure for all 6 RX treatments.
- Filtered P value column by less than 0.05, 0.01, and 0.001. Recorded results and percentages out of 3,926.
- Filtered Bonferroni p value column by less than 0.05 and recorded results and percentages out of 3,926. The table can be located on my updated excel file from 12/3 found on the team Wiki.
From the check I found that RX-1-30 had the most significantly changed genes at 66%, which was at least 7% higher than the treatment with the next most significantly changed genes (59%). I chose this set of p values for the Benjamini & Hochberg p value correction.
Benjamini & Hochberg p value correction
- I inserted a new worksheet named "B-H_Pvalue"
- Copy and pasted the ID column from the previous worksheet into the first column of the B-H_Pvalue worksheet.
- Inserted a column on the far left and named it "MasterIndex". I typed 1 into Cell A2 and 2 into Cell A3 and selected both, filling the entire column with a series of numbers from 1 to 3926.
- I pasted the unadjusted p values from column AN on the Statistics worksheet for Pvalue_LogFC RX-1-30.
- I selected columns A, B, and C and sorted by ascending values on Column C (Sort & Filter > Custom Sort).
- Typed "Rank" into cell D1 and typed 1 into cell D2 and 2 into cell D3. I selected both and filled the column with numbers 1 to 3926.
- Typed B-H_Pvalue in cell E1 and typed the formula =(C2*3926)/D2 into cell E2. I copied the equation for the whole column.
- Typed B-H_Pvalue into cell F1 and typed =IF(E2>1,1,E2) into cell F2. Copied to entire column.
- I selected columns A through F and sorted them by MasterIndex Column A in ascending order.
- I copied column F and pasted values into the right column of the Stats sheet.
RX .txt format GenMAPP ready 12/3
Running GenMAPP and MAPPFinder
The following portion of the assignment was based off instructions found at GenMAPP and MAPPFinder Protocols.
I downloaded the .gdb file from Trixie's page and opened GenMAPP. I selected my Thaw Space as the root folder and went to Choose Gene Database > Sf-Std_20151201.gdb. Then, I went to the Expression Dataset Manager and attempted to choose my dataset, the CompiledRawData .txt file from 12/3. I received the following error message:
At that point I went to speak to Dr. Dahlquist. She recommended three things:
- Replace the periods in the headers with a word
- Replace all #DIV/0! with a space character
- Communicate all gene IDs from the exceptions file listed as "not found" with the QA and Coder
I then went back to my Compiled Raw Data file to make the necessary changes and perform a re-export for GenMAPP. In the "ForGenMAPP" file I replaced all periods in the headings with the word "pt" for 21 total replacements. Then, I replaced all "#DIV/0!" with a space character which yielded 187 replacements. I saved this file as an Excel file and then also as a tab-delimited (.txt) file. I chose this new dataset in the Expression Dataset Manager and clicked "OK" to convert the raw expression data and generate the exceptions file. 416 errors were detected in the raw data.
RX Compiled Raw Data as of 12/6
RX .txt format updated as of 12/6
Instructions from Dr. Dahlquist
- Increased LogFC>0.25 and p<0.05
- Decreased LogFC<-0.25 and p<0.05
- Do Benjamini-Hochberg correction for the treatment with the most significant genes out of all 6