Anuvarsh Week 8

From LMU BioDB 2015
Revision as of 01:32, 24 October 2015 by Anuvarsh (Talk | contribs) (Prepare file for GenMAPP: finished the procedure)

Jump to: navigation, search

Electronic Lab Notebook

Statistical Analysis of Vibrio cholerae Microarray Data (Part 1)

  • I downloaded the Merrell_Compiled_Raw_Data_Vibrio.xls file to my Desktop and saved a copy with my initials (can be found in the Week 8 Zip File)

Normalize the log ratios for the set of slides in the experiment

  • I created a new Worksheet in my Excel file and named it "scaled_centered"
  • I transferred all of the data in the "complied_raw_data" worksheet into the "scaled_centered" worksheet
  • I then created two new rows and titled them "Average" and "StdDev"
  • I computed the Average log ratio for each column of data (where each column represents one chip) using =AVERAGE(B4:B5224), and copied the formula into the rest of the row
  • Then, I computed the Standard Deviation of the log ratios using the equation =STDEV(B4:B5224), and then copied this formula into the rest of the row
  • I then went through and added names to the columns to the right of the raw data with following the format "A1_scaled_centered","A2_scaled_centered_, etc.
  • In cell N4, I typed the following equation: =(B4-B$2)/B$3 and copied this formula into the rest of the column
    • This step is important to do because it normalizes the averages of the data at 0.
  • I repeated this step for all of the "_scaled_centered" columns

Perform statistical analysis on the ratios

  • After creating a new worksheet called "statistics", I copied and pasted the "ID" column from the "scaled_centered" worksheet
  • I then copied over all of the "_scaled_centered" columns from the "scaled_centered" sheet into the "statistics" sheet using Past Special>Values, and deleted the rows titled "Average" and "StDev"
  • I then titled three new columns "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C"
  • Under each of these columns, I calculated the average log fold change with the equation =AVERAGE(B2:E2) where the designated cells selected correspond to the scaled and centered data for that patient ("Avg_LogFC_A" has cells from all of the "A#_scaled_centered" columns)
  • The previous step was modified and repeated for each of the other "Avg_LogFC" columns
  • I then computed the averages of the averages in the column titled "Avg_LogFC_all"
  • I created a new header for the column next to "Avg_LogFC_all" and computed the Tstat with the formula =AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates))
  • I then created another header for the column adjacent to the "Tstat" column and named it "Pval"
  • In this column, I calculated the P value using the equation: =TDIST(ABS(R2),degrees of freedom,2)

Calculate the Bonferroni p value correction

  • I created two columns to the right of the "Pval" column and named both of them "Bonferroni_Pvalue"
  • In the first column, I wrote the equation =S2*5221 in the first row and then copied the formula throughout the column
  • In the second column, I created a condition that replaced any p value greater than 1 with a 1 using the equation =IF(T2>1,1,T2)

Calculate the Benjamini & Hochberg p value Correction

  • I created a new worksheet called "B-H_Pvaue"
    • This is important because we will be reordering the data and we want to prevent the rest of the sheet from being affected by the calculations required for the B-H p value
  • I copied the ID column from the previous worksheets into this sheet, and created a "MasterIndex" sheet to the left of the ID column
  • In the MasterIndex column, I typed 1 and 2 into cells A2 and A3, selected these two cells, and then double clicked on the black + sign on the corner of the cell to fill every cell in this cell with corresponding data in the "ID" column
  • I then copied the p values column from the previous sheet into Column C using Paste Special>Values
  • I selected all three columns with data in them and sorted them by ascending values in column C.
  • I then created a new column titled "Rank" and numbered each of the rows in that column using the same strategy as the "MasterIndex" column
  • Two new columns were titled "B-H_Pvalue"
  • In the first "B-H_Pvalue" column, I entered the equation =(C2*5221)/D2 and copied that formula into the rest of the column
  • In the second "B-H_Pvalue" column I created a condition where any B-H p value that is greater than 1 is replaced with the number 1 using the formula =IF(E2>1,1,E2), and copied the formula to the rest of the column
  • I then selected columns A through F and reordered them in ascending order by the Master Index and copies the last "B-H_Pvalue" column into my "statistics" sheet

Prepare file for GenMAPP

  • A new sheet was created and named "forGenMAPP"
  • Everything was copied from the statistics sheet into this sheet using Paste Special>Values
  • All of the values in the fold change columns were modified to only show 2 decimal places
  • All of the columns containing p values were modified to show only 4 decimal places
  • The left-most Bonferroni p value was removed, leaving behind only the column containing the "if" statement
  • A new column was added to the left of the "ID" column was created with the title "SystemCode" and was populated by the value "N" in every row
  • This worksheet was saved as a tab delimited *.txt file

Sanity Check: Number of genes significantly change

Sanity Check: Compare individual genes with known data

MAPPFinder Analysis of Vibrio cholerae Microarray Data (Part 2)

Reminders

  • Finish the Sanity Check
  • I used the 2010 Vibrio Cholera database


  • When uploading my tab delimited file to the GenMAPP software for conversion, 121 errors were detected in my raw data.
  • My partner, Brandon L., had 772 errors detected in his raw data conversion while using the 2009 Vibrio Cholera database.
    • This difference in errors is most likely due to the extensiveness of our databases. Because Brandon was using an older version of the database while I was using a newer version, it is possible that his version had fewer genes than mine.
  • We produced our color set to indicate increased expression.
  • Top 10 Gene Ontology Results:
    1. branched chain family amino acid metabolic process
    2. branched chain family amino acid biosynthetic process
    3. IMP metabolic process
    4. IMP biosyntehtic process
    5. purine ribonucleoside monophosphate metabolic process
    6. purine nucleoside monophosphate metabolic process
    7. purine nucleoside monophosphate biosyntehtic process
    8. purine ribonucleoside monophosphate biosyntehtic process
    9. arginine metabolic process
    10. ccellular nitrogen compound biosynthetic process
  • Searching through MAPPFinder for published genes:
    • VC0028: metal ion binding, iron-sulfur cluster binding, 4 iron 4 sulfur cluster binding, catalytic activity, lyase activity
    • VC0941: pyridoxal phosphate binding, catalytic activity, transferase activity, glycine hydroxymethyltransferase activity
    • VC0869: nucleotide binding, ATP binding, catalytic activity, lyase activity
    • VC0051: nucleotide binding, ATP binding, catalytic activity, lyase activity, carboxy-lyase activity
    • VC0647: 3'-5'-exoribonuclease activity, transferase activity, nucleotidyltransferase activity, polyribonucleotide nucleotidyltransferase activity
    • VC0468: metal ion binding, nucleotide binding, ATP binding, catalytic activity, ligase activity, glytathione synthase activity
    • VC2350: catalytic activity, lyase activity, deoxyribose-phosphate aldolase activity
    • VCA0583: outer membrane-bounded periplasmic space
  • The GO term I clicked on was outer membrane-bounded periplasmic space
    • The gene I was looking for was VCA0583, which translated to Q9KM06 on Uniprot
    • This gene did not have a significant change in expression
    • This gene is responsible for transporter activity, transport, and outer membrane-bounded periplasmic space
  • TODO:Look at the top of the spreadsheet. There are rows of information that give you the background information on how MAPPFinder made the calculations. Compare this information with your partner who used a different version of the Vibrio Gene Database. Which numbers are different? Why are they different? Record this information in your individual journal entry.
  • TODO:Interpret your results. Look up the definitions for any GO terms that are unfamiliar to you. The "official" definitions for GO terms can be found at http://www.geneontology.org. You can use one of the online biological dictionaries as a supplement, if needed. Write a paragraph relating the results of this GO analysis to the experiment performed (comparing laboratory-grown and patient-derived Vibrio cholerae. You need to give a biological interpretation of what do each of these GO terms in your filtered list have to to with the pathogenecity of the bacterium? You may consult with your partner on this, but your explanation on your individual journal page needs to be in your own words. This is where the real "brain power" comes in with interpreting DNA microarray data. Even experienced scientists struggle with this part. Use your creativity as a scientist to stretch your brain in this question.

Other Links

User Page: Anindita Varshneya
Class Page: BIOL/CMSI 367: Biological Databases, Fall 2015
Group Page: GÉNialOMICS

Assignment Pages

Week 1 Assignment
Week 2 Assignment
Week 3 Assignment
Week 4 Assignment
Week 5 Assignment
Week 6 Assignment
Week 7 Assignment
Week 8 Assignment
Week 9 Assignment
Week 10 Assignment
Week 11 Assignment
Week 12 Assignment
No Week 13 Assignment
Week 14 Assignment
Week 15 Assignment

Individual Journals

Individual Journal Week 2
Individual Journal Week 3
Individual Journal Week 4
Individual Journal Week 5
Individual Journal Week 6
Individual Journal Week 7
Individual Journal Week 8
Individual Journal Week 9
Individual Journal Week 10
Individual Journal Week 11
Individual Journal Week 12
Individual Journal Week 14
Individual Journal Week 15

Shared Journals

Class Journal Week 1
Class Journal Week 2
Class Journal Week 3
Class Journal Week 4
Class Journal Week 5
Class Journal Week 6
Class Journal Week 7
Class Journal Week 8
Class Journal Week 9
GÉNialOMICS Journal Week 10
GÉNialOMICS Journal Week 11
GÉNialOMICS Journal Week 12
GÉNialOMICS Journal Week 14
GÉNialOMICS Journal Week 15