Electronic Lab Notebook

All Files for Week 8 Journal

Statistical Analysis of Vibrio cholerae Microarray Data (Part 1)

I downloaded the Merrell_Compiled_Raw_Data_Vibrio.xls file to my Desktop and saved a copy with my initials (can be found in the Week 8 Zip File)

Normalize the log ratios for the set of slides in the experiment

I created a new Worksheet in my Excel file and named it "scaled_centered"
I transferred all of the data in the "complied_raw_data" worksheet into the "scaled_centered" worksheet
I then created two new rows and titled them "Average" and "StdDev"
I computed the Average log ratio for each column of data (where each column represents one chip) using =AVERAGE(B4:B5224), and copied the formula into the rest of the row
Then, I computed the Standard Deviation of the log ratios using the equation =STDEV(B4:B5224), and then copied this formula into the rest of the row
I then went through and added names to the columns to the right of the raw data with following the format "A1_scaled_centered","A2_scaled_centered_, etc.
In cell N4, I typed the following equation: =(B4-B$2)/B$3 and copied this formula into the rest of the column
- This step is important to do because it normalizes the averages of the data at 0.
I repeated this step for all of the "_scaled_centered" columns

Perform statistical analysis on the ratios

After creating a new worksheet called "statistics", I copied and pasted the "ID" column from the "scaled_centered" worksheet
I then copied over all of the "_scaled_centered" columns from the "scaled_centered" sheet into the "statistics" sheet using Past Special>Values, and deleted the rows titled "Average" and "StDev"
I then titled three new columns "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C"
Under each of these columns, I calculated the average log fold change with the equation =AVERAGE(B2:E2) where the designated cells selected correspond to the scaled and centered data for that patient ("Avg_LogFC_A" has cells from all of the "A#_scaled_centered" columns)
The previous step was modified and repeated for each of the other "Avg_LogFC" columns
I then computed the averages of the averages in the column titled "Avg_LogFC_all"
I created a new header for the column next to "Avg_LogFC_all" and computed the Tstat with the formula =AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates))
I then created another header for the column adjacent to the "Tstat" column and named it "Pval"
In this column, I calculated the P value using the equation: =TDIST(ABS(R2),degrees of freedom,2)

Calculate the Bonferroni p value correction

I created two columns to the right of the "Pval" column and named both of them "Bonferroni_Pvalue"
In the first column, I wrote the equation =S2*5221 in the first row and then copied the formula throughout the column
In the second column, I created a condition that replaced any p value greater than 1 with a 1 using the equation =IF(T2>1,1,T2)

Calculate the Benjamini & Hochberg p value Correction

I created a new worksheet called "B-H_Pvaue"
- This is important because we will be reordering the data and we want to prevent the rest of the sheet from being affected by the calculations required for the B-H p value
I copied the ID column from the previous worksheets into this sheet, and created a "MasterIndex" sheet to the left of the ID column
In the MasterIndex column, I typed 1 and 2 into cells A2 and A3, selected these two cells, and then double clicked on the black + sign on the corner of the cell to fill every cell in this cell with corresponding data in the "ID" column
I then copied the p values column from the previous sheet into Column C using Paste Special>Values
I selected all three columns with data in them and sorted them by ascending values in column C.
I then created a new column titled "Rank" and numbered each of the rows in that column using the same strategy as the "MasterIndex" column
Two new columns were titled "B-H_Pvalue"
In the first "B-H_Pvalue" column, I entered the equation =(C2*5221)/D2 and copied that formula into the rest of the column
In the second "B-H_Pvalue" column I created a condition where any B-H p value that is greater than 1 is replaced with the number 1 using the formula =IF(E2>1,1,E2), and copied the formula to the rest of the column
I then selected columns A through F and reordered them in ascending order by the Master Index and copies the last "B-H_Pvalue" column into my "statistics" sheet

Prepare file for GenMAPP

A new sheet was created and named "forGenMAPP"
Everything was copied from the statistics sheet into this sheet using Paste Special>Values
All of the values in the fold change columns were modified to only show 2 decimal places
All of the columns containing p values were modified to show only 4 decimal places
The left-most Bonferroni p value was removed, leaving behind only the column containing the "if" statement
A new column was added to the left of the "ID" column was created with the title "SystemCode" and was populated by the value "N" in every row
This worksheet was saved as a tab delimited *.txt file

Sanity Check: Number of genes significantly change

Sanity Check: Compare individual genes with known data

MAPPFinder Analysis of Vibrio cholerae Microarray Data (Part 2)

Reminders

Finish the Sanity Check
I used the 2010 Vibrio Cholera database

When uploading my tab delimited file to the GenMAPP software for conversion, 121 errors were detected in my raw data.
My partner, Brandon L., had 772 errors detected in his raw data conversion while using the 2009 Vibrio Cholera database.
- This difference in errors is most likely due to the extensiveness of our databases. Because Brandon was using an older version of the database while I was using a newer version, it is possible that his version had fewer genes than mine.
We produced our color set to indicate increased expression.
Top 10 Gene Ontology Results:
1. branched chain family amino acid metabolic process
2. branched chain family amino acid biosynthetic process
3. IMP metabolic process
4. IMP biosyntehtic process
5. purine ribonucleoside monophosphate metabolic process
6. purine nucleoside monophosphate metabolic process
7. purine nucleoside monophosphate biosyntehtic process
8. purine ribonucleoside monophosphate biosyntehtic process
9. arginine metabolic process
10. ccellular nitrogen compound biosynthetic process
Searching through MAPPFinder for published genes:
- VC0028: metal ion binding, iron-sulfur cluster binding, 4 iron 4 sulfur cluster binding, catalytic activity, lyase activity
- VC0941: pyridoxal phosphate binding, catalytic activity, transferase activity, glycine hydroxymethyltransferase activity
- VC0869: nucleotide binding, ATP binding, catalytic activity, lyase activity
- VC0051: nucleotide binding, ATP binding, catalytic activity, lyase activity, carboxy-lyase activity
- VC0647: 3'-5'-exoribonuclease activity, transferase activity, nucleotidyltransferase activity, polyribonucleotide nucleotidyltransferase activity
- VC0468: metal ion binding, nucleotide binding, ATP binding, catalytic activity, ligase activity, glytathione synthase activity
- VC2350: catalytic activity, lyase activity, deoxyribose-phosphate aldolase activity
- VCA0583: outer membrane-bounded periplasmic space
The GO term I clicked on was outer membrane-bounded periplasmic space
- The gene I was looking for was VCA0583, which translated to Q9KM06 on Uniprot
- This gene did not have a significant change in expression
- This gene is responsible for transporter activity, transport, and outer membrane-bounded periplasmic space
TODO:Look at the top of the spreadsheet. There are rows of information that give you the background information on how MAPPFinder made the calculations. Compare this information with your partner who used a different version of the Vibrio Gene Database. Which numbers are different? Why are they different? Record this information in your individual journal entry.
TODO:Interpret your results. Look up the definitions for any GO terms that are unfamiliar to you. The "official" definitions for GO terms can be found at http://www.geneontology.org. You can use one of the online biological dictionaries as a supplement, if needed. Write a paragraph relating the results of this GO analysis to the experiment performed (comparing laboratory-grown and patient-derived Vibrio cholerae. You need to give a biological interpretation of what do each of these GO terms in your filtered list have to to with the pathogenecity of the bacterium? You may consult with your partner on this, but your explanation on your individual journal page needs to be in your own words. This is where the real "brain power" comes in with interpreting DNA microarray data. Even experienced scientists struggle with this part. Use your creativity as a scientist to stretch your brain in this question.

Anuvarsh Week 8

Contents

Electronic Lab Notebook

Statistical Analysis of Vibrio cholerae Microarray Data (Part 1)

Normalize the log ratios for the set of slides in the experiment

Perform statistical analysis on the ratios

Calculate the Bonferroni p value correction

Calculate the Benjamini & Hochberg p value Correction

Prepare file for GenMAPP

Sanity Check: Number of genes significantly change

Sanity Check: Compare individual genes with known data

MAPPFinder Analysis of Vibrio cholerae Microarray Data (Part 2)

Reminders

Other Links

Assignment Pages

Individual Journals

Shared Journals

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools