LMU BioDB 2015 - User contributions [en]

GENialOMICS Deliverables

2015-12-18T23:49:18Z

Vpachec3: /* Individual Assessments and Reflections */ added reflection

[[Image: Genialomics-banner.jpg | center | 1055px]]
 

= Group Files and Datasets =

* [[Media:Bc-Std GEN Build4 20151204.zip|GenMAPP Gene Database for assigned species (''.gdb'') (compressed)]]
* [[Media:ReadMe_Bc-Std_GEN_Build4_20151214_final.pdf|ReadMe file to accompany the Gene Database (''.pdf'')]]
** [[Media: Genialomics-DatabaseSchema-20151211.pdf|Gene Database Schema Diagram]]
* [[media:GÉNialOMICS_Gene_Database_Testing_Report_(Build_4_Export)_-_LMU_BioDB_2015.pdf|Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)]]
* [[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')]]
* [[media:For_genMAPP_KWVP20151205.txt|Data file used for import into GenMAPP (''.txt'' or ''.csv'')]]
* [[media:KWVP20151205.gex|GenMAPP Expression Dataset file (''.gex'')]]
* [[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file of data imported into GenMAPP (''.EX.txt'')]]
* Raw MAPPFinder results files (''-GO.txt'')
** [[media:KWVP_MAPPfinder_20151208-increase-GO.txt|Increase]]
** [[media:KWVP_MAPPfinder_20151208-decrease-GO.txt|Decrease]]
* [[media:KWVP20151205.gmf|''.gmf'' file]]
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
** [[media:KWVP_MAPPfinder_20151208-increase-GO.xlsx|Increase]]
** [[media:Vpkwmappfinder20151205-Criterion1-GO-decreased.xlsx|Decrease]]
*[[media:Oxphosmappkwvp20151212.mapp|Sample MAPP file of a relevant biological pathway for your species (''.mapp'')]]
* [[Media:BioDBFinalReport-Genialomics.pdf | Group Report]]
* [[Media:Genialomics-BioDBFinalPresentation.pdf|Final PowerPoint presentation]]

==Individual Assessments and Reflections==
*[[Media:ReflectionForFinalProject-AV.pdf | Anindita Varshneya]]
*[[Media:KWreflection.pdf | Kevin Wyllie]]
*[[Blitvak Individual Assessment and Reflection| Brandon Litvak]]
*[[Media:Veronica Pacheco.pdf|Veronica Pacheco]]

 

{{Template:GÉNialOMICS}}

File:Veronica Pacheco.pdf

2015-12-18T23:48:00Z

Vpachec3:

GENialOMICS

2015-12-15T07:32:16Z

Vpachec3: /* Individual Goals and Progress */ added mapp file to table

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Create testing reports on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
* It was found that the gene names of interest existed under the label of "ORF" in the XML file (explained why they weren't captured by GenMAPP builder)
* Created a streamlined general expression that can capture all of the IDs of interest; shared with Anu, assisted in the creation of Build 3/4 of the modified genmapp builder
* Found that XMLPipeDB Match gave 6 extra counts using the general expression for the IDs; excel MATCH command analysis was conducted and the discrepant IDs were found
* Completed gene database test reports for builds 2, 3, and 4
|
''Work with Kevin Wyllie''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|
''Work with Veronica Pacheco''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Build 2 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Build 4 Export Testing Report]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:GEN BL14 20151201.zip|Files from work done on 12/01/15]]
*[[Media:GEN BL14 20151203.zip|Files from work done on 12/03/15]]
*[[Media:GEN BL14 20151204.zip|Files from work done on 12/04/15]]
*[[Media:GEN BL14 20151207.zip|Files from work done on 12/07/15]]
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
| see Kevin's section -->
|
*[[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Final processed/formatted microarray data spreadsheet]]
*[[media:For_genMAPP_KWVP20151205.txt|txt form of the forGenMAPP sheet]]
*[[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

==Reflections==
===Anu Varshneya===
*What worked?
**In general, I think our group worked very well together! I think we are all motivated to get this project done well, and are communicating well with each other regarding our progress.
*What didn't work?
**I think for the most part we did a great job. I think the only ideas I have moving forward is a little bit more planning in regards to how we plan to attack the writing and presentation portion of the project. I am not concerned about us getting it done on time and with good quality, just that we create a plan of attack soon so that everyone is on the same page. :)
*What will I do next to fix what didn't work?
**Though nothing has not worked, I think we will just talk tomorrow about how we want to approach the writing and the presentation and set up some group work times.

===Kevin Wyllie===
# What worked?
#* Our initial GenMAPP import worked! 284 errors, which, out of 7251, does sound so bad to me!
# What didn't work?
#* Maybe this isn't actually an example of something not working, but our calculated fold changes were quite different (much lower in magnitude) from those reported in Van Acker et al's paper. However, they had the same directions and generally saw the same relative trends (ie the relatively higher fold changes in the paper were among the higher in our data). Also, very few of the genes they considered significant (with their super-lenient criteria that results in 30% of the genes seeing significant changes) were significant by our criteria.
# What will I do next to fix what didn't work?
#* We just need to triple/quadruple check that our data processing protocol is legitimate. Other than that, there's not much we can do in terms of fold changes. And for statistical significance, we potentially should reconsider heightening our BH P-value threshold above 0.05, as currently we're only considering about 8% of the genes to see a significant change. But maybe this is not too low of a number.

===Veronica Pacheco===
#What worked?
#* Right off the bat, our first run through GenMAPP worked. As expected, there were exceptions and it generated an EX.txt file. There were 284 errors. We then handed over the file to Brandon so we shall see if the number of errors can decrease.
#What didn't work?
#*What we initially thought didn't work or didn't seem correct was that fact that our values for fold changes were much smaller than the values reported in Van Acker et al's paper. The direction, for the majority, aligned with what was reported however the concrete values had large differences.
#What will I do next to fix what didn't work?
#*Initially, we went to make sure our methods were correct. We traced back our steps and made sure the calculations were done correctly.After we double checked, we then sought help from Dr. Dahlquist and Dr. Dionisio. From their response, it seems we should go over it one more time and if there is no source of error on our part, we continue the project with our fold changes.

===Brandon Litvak===
#What worked?
#*I think a lot of things worked this week. Team work and communication was a great help in getting the bulk of this week's work done. The initial exported database was not working as planned and, as a team, we discovered that the reason had to do with the fact that GenMAPP builder was utilizing the wrong type of gene name; this knowledge allowed us to create builds this week that happened to work fairly well. These new builds covered the gene names of interest and led to a relatively small amount errors in GenMAPP. I think that, above all, the thing that worked best this week was my team. We were able to communicate and collaborate very well.
#What didn't work?
#*At the present moment, I can't really think of things that really did not work. With respect to the gene database project for J2315, everything appears to be on track; I would say that the major problems that were encountered in Week 14 were resolved. I feel that all of the major work for the project is complete; all that remains, is to synthesize the work done in a paper and presentation. As a group, we did get little work done on the final deliverables (which should be the focus, for this week) but we did get a lot of valuable work done for the project. We haven't managed to plan much regarding the final deliverables, either (but this is a minor issue).
#What will I do next to fix what didn't work?
#*We will need to meet as a group and discuss the state of the project. I personally feel really good about the work so far and it would be helpful to hear, with the bulk of the work done, how everyone else feels. Additionally, I think that we will need to plan out our approach for the final project as soon as possible. Once we have discussed the project and made a plan, I think that we should set aside some time to work on the group project, as a team. I will check in with the group members on Tuesday, share my major findings for the week, and discuss future courses of action (regarding the last bits of the project).

=Week 15=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Fix any problems with the build
* Finish presentation
* Start paper
|
* Work with Anu to fix any issues with the final gene database
* Finish presentation
* Prepare for presentation
* Work on final paper
* Conduct final checks on last export (check testing report, make any adjustments)
|
''Work with Kevin Wyllie''
* Finish example MAPPs (one for increase and decrease).
* Create sanity check table (with P-values).
* Upload deliverables.
* Begin PowerPoint.
* Begin report.
|
''Work with Veronica Pacheco''
* Finish example MAPPs (one for increase and decrease).
* Create sanity check table (with P-values).
* Upload deliverables.
* Begin PowerPoint.
* Begin report.
|-
!scope="row"|'''Progress'''
|
* Created a new build to fix problem in TallyEngine.
* Contributed to final presentation and practiced speaking.
* Began drafting final report of findings from this project.
* Created ReadMe file.
* Modified Gene Database Schema for B. cenocepacia.
|
* Checked final export: everything appears in order
* Checked new build of GenMAPP builder
* Realized why the UniProt count of 6994 is so different from the one reported by our database (difference between protein entry/gene name)
* Finalized readme
* Helped in finalizing the slides
* Practiced presentation
|
* Finished example MAPP (decided do chose one pathway, oxidative phosphorylation).
* Created sanity check table (with P-values)
* Uploaded deliverables
* Began and finished PowerPoint.
* Began report.
|
* Finished example MAPP (decided do chose one pathway, oxidative phosphorylation).
* Created sanity check table (with P-values)
* Uploaded deliverables
* Began and finished PowerPoint.
* Began report.
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 15]]
|
*[[Blitvak Week 15]]
|
*[[Vpachec3 Week 15]]
|
*[[Kevin Wyllie Week 15]]
|-
!scope="row"|'''Files Used/Created'''
|
*[[Media:ReadMe_Bc-Std_GEN_Build4_20151204.doc.zip | readMe_Bc-Std_GEN_Build4_20151204.doc.zip]] 
*[[Media:gmbuilder-genialomics-20151210-build-5.zip | gmbuilder-genialomics-20151210-build-5.zip]] 
*[[Media:Genialomics-DatabaseSchema-20151211.pdf | Genialomics-DatabaseSchema-20151211.pdf]]
*[[Media:BioDB_Final_Presentation_GEN_20151214.pdf| Final Presentation Slides (.pdf)]]
|
*[[Media:ReadMe Bc-Std GEN Build4 20151214 final.pdf| Final Readme]]
*[[Media:GÉNialOMICS Gene Database Testing Report (Build 4 Export) - LMU BioDB 2015.pdf| Final Testing Report, as PDF]]
*[[Media:BioDB_Final_Presentation_GEN_20151214.pdf| Final Presentation Slides (.pdf)]]
|
*[[Media:BioDB_Final_Presentation_GEN_20151214.pdf| Final Presentation Slides (.pdf)]]
*[[Media:Oxphosmappkwvp20151212.mapp| MAPP ]]
|
* Generated sanity check table.
* Upload deliverables.
*[[Media:BioDB_Final_Presentation_GEN_20151214.pdf| Final Presentation Slides (.pdf)]]
*[[Media:Oxphosmappkwvp20151212.mapp| MAPP]]
|}

==Other Progress==
* On December 14, 2015, GENialOMICS completed a presentation summarizing their methods and findings.
** [[Media:Genialomics-BioDBFinalPresentation.pdf | GENialOMICS Final Presentation]]

=Deliverables=
*See [[GENialOMICS Deliverables|GENialOMICS Deliverables]]
=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

Vpachec3 Week 15

2015-12-15T07:17:48Z

Vpachec3: /* MAPP Making */ fixed images

== MAPPFinder Procedure ==
# MAPPFinder was launched from the GenMAPP window: "Tools" > "MAPPFinder".
# Since the criterion for the Expression Dataset file was already made. We were able to choose this file for the set criterion and our .gbd file was already located in GenMAPP.
# Under the "Select Color Set" field, "KWVP_20151205" was selected.
# After the color set has been selected, under the "Select Criteria to filter by" field, either increase or decrease was chosen. I did decreased and Kevin did increased.
# The boxes for "Gene Ontology" and "Click here to calculate p values..." were checked.
# The location to save the resulting .txt file was selected with the "Browse" option toward the bottom of the window.
# "Run MAPPFinder" was selected to generate the gene ontology. When we clicked on the tree for a given term there should have been a mapp generated. However, the program was not responding, thus we had to make our own map.

'''BEFORE creating the mapp, we wanted to filter the GO terms in excel to narrow down which pathway we would want to map out.'''

== Filtered GO List ==
DECREASED INFO
*The GO term files above were opened in Excel, and the following filters were placed on the given columns:
** Z score: greater than 2
** PermuteP: less than 0.05
** Number changed: greater than or equal to 5 AND less than 100
** Percent changed: great than or equal to 25
* Values in the following columns were recorded:
** Number Changed
** Number Measured
** Number in GO
** Percent Changed
** Percent Present
** PermuteP
** AdjustedP
* Some of the above filter criteria had to be adjusted to attain 16 non-redundant GO terms.

==MAPP Making==
* We decided to map the pathway of oxidative phosphorylation. We chose this because on it was on the list of the top ten GO terms for decreased. Also, Van Acker et al (2013) referenced that they noticed a decrease in the electron transport chain and part of this correlation was with the oxidative phosphorylation pathway.

* Using the Kegg [http://www.genome.jp/kegg/pathway.html] website, we were able to look up our bacteria, ''Burkholderia cenocepcia J2315'' and found the oxidative phosphorylation pathway.
[[File:Screen Shot 2015-12-14 at 11.10.55 PM.png]]

We wanted to mirror the format of this page, as much as we possibly could, in GenMAPP to see our results in whether the genes in this pathway was decreased. We predicted that the genes should be decreased because of the results provided in the paper and more so in our GO terms.

[[File:Oxphosmapp.png]]

Our map shows that the genes were in fact decreased.

==Sanity Check Table Update==
There was also a new sanity check table made to summarize the different p-values tested. We decided to use the BH pvalue and the table highlights the other possibilities of criteria. Kevin [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/Kevin_Wyllie_Week_15] had the table on his page.

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]
{{Template:Vpachec3 journal links}}
[[Category: Journal Entry]]
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

==GÉNialOMICS Links==
{{Template:GÉNialOMICS}}
[[Category: Journal Entry]]

[[Category: Journal Entry]]

Vpachec3 Week 15

2015-12-15T07:17:04Z

Vpachec3: added more content

== MAPPFinder Procedure ==
# MAPPFinder was launched from the GenMAPP window: "Tools" > "MAPPFinder".
# Since the criterion for the Expression Dataset file was already made. We were able to choose this file for the set criterion and our .gbd file was already located in GenMAPP.
# Under the "Select Color Set" field, "KWVP_20151205" was selected.
# After the color set has been selected, under the "Select Criteria to filter by" field, either increase or decrease was chosen. I did decreased and Kevin did increased.
# The boxes for "Gene Ontology" and "Click here to calculate p values..." were checked.
# The location to save the resulting .txt file was selected with the "Browse" option toward the bottom of the window.
# "Run MAPPFinder" was selected to generate the gene ontology. When we clicked on the tree for a given term there should have been a mapp generated. However, the program was not responding, thus we had to make our own map.

'''BEFORE creating the mapp, we wanted to filter the GO terms in excel to narrow down which pathway we would want to map out.'''

== Filtered GO List ==
DECREASED INFO
*The GO term files above were opened in Excel, and the following filters were placed on the given columns:
** Z score: greater than 2
** PermuteP: less than 0.05
** Number changed: greater than or equal to 5 AND less than 100
** Percent changed: great than or equal to 25
* Values in the following columns were recorded:
** Number Changed
** Number Measured
** Number in GO
** Percent Changed
** Percent Present
** PermuteP
** AdjustedP
* Some of the above filter criteria had to be adjusted to attain 16 non-redundant GO terms.

==MAPP Making==
* We decided to map the pathway of oxidative phosphorylation. We chose this because on it was on the list of the top ten GO terms for decreased. Also, Van Acker et al (2013) referenced that they noticed a decrease in the electron transport chain and part of this correlation was with the oxidative phosphorylation pathway.

* Using the Kegg [http://www.genome.jp/kegg/pathway.html] website, we were able to look up our bacteria, ''Burkholderia cenocepcia J2315'' and found the oxidative phosphorylation pathway.
[File:Screen Shot 2015-12-14 at 11.10.55 PM.png]

We wanted to mirror the format of this page, as much as we possibly could, in GenMAPP to see our results in whether the genes in this pathway was decreased. We predicted that the genes should be decreased because of the results provided in the paper and more so in our GO terms.

[File:Oxphosmapp.png]

Our map shows that the genes were in fact decreased.

==Sanity Check Table Update==
There was also a new sanity check table made to summarize the different p-values tested. We decided to use the BH pvalue and the table highlights the other possibilities of criteria. Kevin [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/Kevin_Wyllie_Week_15] had the table on his page.

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]
{{Template:Vpachec3 journal links}}
[[Category: Journal Entry]]
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

==GÉNialOMICS Links==
{{Template:GÉNialOMICS}}
[[Category: Journal Entry]]

[[Category: Journal Entry]]

File:Oxphosmapp.png

2015-12-15T07:12:39Z

Vpachec3:

File:Screen Shot 2015-12-14 at 11.10.55 PM.png

2015-12-15T07:07:08Z

Vpachec3:

Vpachec3 Week 15

2015-12-15T06:46:29Z

Vpachec3: /* MAPPFinder Procedure */ edited wording

=== MAPPFinder Procedure ===
# MAPPFinder was launched from the GenMAPP window: "Tools" > "MAPPFinder".
# Since the criterion for the Expression Dataset file was already made. We were able to choose this file for the set criterion and our .gbd file was already located in GenMAPP.
# Under the "Select Color Set" field, "KWVP_20151205" was selected.
# After the color set has been selected, under the "Select Criteria to filter by" field, either increase or decrease was chosen. I did decreased and Kevin did increased.
# The boxes for "Gene Ontology" and "Click here to calculate p values..." were checked.
# The location to save the resulting .txt file was selected with the "Browse" option toward the bottom of the window.
# "Run MAPPFinder" was selected to generate the gene ontology. When we clicked on the tree for a given term there should have been a mapp generated. However, the program was not responding, thus we had to make our own map.

=== Filtered GO List ===

*The GO term files above were opened in Excel, and the following filters were placed on the given columns:
** Z score: greater than 2
** PermuteP: less than 0.05
** Number changed: greater than or equal to 4 AND less than 100
** Percent changed: great than or equal to 15
* Values in the following columns were recorded:
** Number Changed
** Number Measured
** Number in GO
** Percent Changed
** Percent Present
** PermuteP
** AdjustedP
* Some of the above filter criteria had to be adjusted to attain 16 non-redundant GO terms. A list of these terms is shown below. For those terms which did not exactly fit this initial criteria, the altered criteria is apparent in the values on the table.

==MAPP Making==
TO DO

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]
{{Template:Vpachec3 journal links}}
[[Category: Journal Entry]]
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

==GÉNialOMICS Links==
{{Template:GÉNialOMICS}}
[[Category: Journal Entry]]

[[Category: Journal Entry]]

Vpachec3 Week 15

2015-12-15T00:25:03Z

Vpachec3: added new setions

=== MAPPFinder Procedure ===
# MAPPFinder was launched from the GenMAPP window: "Tools" > "MAPPFinder".
# "Calculate New Results" was selected from the MAPPFinder window.
# Since the .gex file was already generated last week, "Find File" was selected and in the directory.
# Under the "Select Color Set" field, "KWVP_20151205" was selected (the color set's name in this case is the same as the .gex file's name - and there's only one color set in the .gex file).
# After the color set has been selected, under the "Select Criteria to filter by" field, either increase or decrease was chosen.
# The boxes for "Gene Ontology" and "Click here to calculate p values..." were checked.
# The location to save the resulting .txt file was selected with the "Browse" option toward the bottom of the window.
# "Run MAPPFinder" was selected to generate the gene ontology tree (and the aforementioned .txt files).
#* The files generated can be found here:
#** [[media:KWVP_MAPPfinder_20151208-decrease-GO.txt|For downregulated GO terms.]]
#** [[media:KWVP_MAPPfinder_20151208-increase-GO.txt|For upregulated GO terms.]]

=== Filtered GO List ===

*The GO term files above were opened in Excel, and the following filters were placed on the given columns:
** Z score: greater than 2
** PermuteP: less than 0.05
** Number changed: greater than or equal to 4 AND less than 100
** Percent changed: great than or equal to 15
* Values in the following columns were recorded:
** Number Changed
** Number Measured
** Number in GO
** Percent Changed
** Percent Present
** PermuteP
** AdjustedP
* Some of the above filter criteria had to be adjusted to attain 16 non-redundant GO terms. A list of these terms is shown below. For those terms which did not exactly fit this initial criteria, the altered criteria is apparent in the values on the table.

==MAPP Making==
TO DO

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]
{{Template:Vpachec3 journal links}}
[[Category: Journal Entry]]
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

==GÉNialOMICS Links==
{{Template:GÉNialOMICS}}
[[Category: Journal Entry]]

[[Category: Journal Entry]]

GENialOMICS

2015-12-15T00:02:49Z

Vpachec3: /* Week 15 */ added goals and progress

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Create testing reports on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
* It was found that the gene names of interest existed under the label of "ORF" in the XML file (explained why they weren't captured by GenMAPP builder)
* Created a streamlined general expression that can capture all of the IDs of interest; shared with Anu, assisted in the creation of Build 3/4 of the modified genmapp builder
* Found that XMLPipeDB Match gave 6 extra counts using the general expression for the IDs; excel MATCH command analysis was conducted and the discrepant IDs were found
* Completed gene database test reports for builds 2, 3, and 4
|
''Work with Kevin Wyllie''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|
''Work with Veronica Pacheco''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Build 2 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Build 4 Export Testing Report]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:GEN BL14 20151201.zip|Files from work done on 12/01/15]]
*[[Media:GEN BL14 20151203.zip|Files from work done on 12/03/15]]
*[[Media:GEN BL14 20151204.zip|Files from work done on 12/04/15]]
*[[Media:GEN BL14 20151207.zip|Files from work done on 12/07/15]]
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
| see Kevin's section -->
|
*[[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Final processed/formatted microarray data spreadsheet]]
*[[media:For_genMAPP_KWVP20151205.txt|txt form of the forGenMAPP sheet]]
*[[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

==Reflections==
===Anu Varshneya===
*What worked?
**In general, I think our group worked very well together! I think we are all motivated to get this project done well, and are communicating well with each other regarding our progress.
*What didn't work?
**I think for the most part we did a great job. I think the only ideas I have moving forward is a little bit more planning in regards to how we plan to attack the writing and presentation portion of the project. I am not concerned about us getting it done on time and with good quality, just that we create a plan of attack soon so that everyone is on the same page. :)
*What will I do next to fix what didn't work?
**Though nothing has not worked, I think we will just talk tomorrow about how we want to approach the writing and the presentation and set up some group work times.

===Kevin Wyllie===
# What worked?
#* Our initial GenMAPP import worked! 284 errors, which, out of 7251, does sound so bad to me!
# What didn't work?
#* Maybe this isn't actually an example of something not working, but our calculated fold changes were quite different (much lower in magnitude) from those reported in Van Acker et al's paper. However, they had the same directions and generally saw the same relative trends (ie the relatively higher fold changes in the paper were among the higher in our data). Also, very few of the genes they considered significant (with their super-lenient criteria that results in 30% of the genes seeing significant changes) were significant by our criteria.
# What will I do next to fix what didn't work?
#* We just need to triple/quadruple check that our data processing protocol is legitimate. Other than that, there's not much we can do in terms of fold changes. And for statistical significance, we potentially should reconsider heightening our BH P-value threshold above 0.05, as currently we're only considering about 8% of the genes to see a significant change. But maybe this is not too low of a number.

===Veronica Pacheco===
#What worked?
#* Right off the bat, our first run through GenMAPP worked. As expected, there were exceptions and it generated an EX.txt file. There were 284 errors. We then handed over the file to Brandon so we shall see if the number of errors can decrease.
#What didn't work?
#*What we initially thought didn't work or didn't seem correct was that fact that our values for fold changes were much smaller than the values reported in Van Acker et al's paper. The direction, for the majority, aligned with what was reported however the concrete values had large differences.
#What will I do next to fix what didn't work?
#*Initially, we went to make sure our methods were correct. We traced back our steps and made sure the calculations were done correctly.After we double checked, we then sought help from Dr. Dahlquist and Dr. Dionisio. From their response, it seems we should go over it one more time and if there is no source of error on our part, we continue the project with our fold changes.

===Brandon Litvak===
#What worked?
#*I think a lot of things worked this week. Team work and communication was a great help in getting the bulk of this week's work done. The initial exported database was not working as planned and, as a team, we discovered that the reason had to do with the fact that GenMAPP builder was utilizing the wrong type of gene name; this knowledge allowed us to create builds this week that happened to work fairly well. These new builds covered the gene names of interest and led to a relatively small amount errors in GenMAPP. I think that, above all, the thing that worked best this week was my team. We were able to communicate and collaborate very well.
#What didn't work?
#*At the present moment, I can't really think of things that really did not work. With respect to the gene database project for J2315, everything appears to be on track; I would say that the major problems that were encountered in Week 14 were resolved. I feel that all of the major work for the project is complete; all that remains, is to synthesize the work done in a paper and presentation. As a group, we did get little work done on the final deliverables (which should be the focus, for this week) but we did get a lot of valuable work done for the project. We haven't managed to plan much regarding the final deliverables, either (but this is a minor issue).
#What will I do next to fix what didn't work?
#*We will need to meet as a group and discuss the state of the project. I personally feel really good about the work so far and it would be helpful to hear, with the bulk of the work done, how everyone else feels. Additionally, I think that we will need to plan out our approach for the final project as soon as possible. Once we have discussed the project and made a plan, I think that we should set aside some time to work on the group project, as a team. I will check in with the group members on Tuesday, share my major findings for the week, and discuss future courses of action (regarding the last bits of the project).

=Week 15=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Fix any problems with the build
* Finish presentation
* Start paper
|
* Work with Anu to fix any issues with the final gene database
* Finish presentation
* Prepare for presentation
* Work on final paper
* Conduct final checks on last export (check testing report, make any adjustments)
|
''Work with Kevin Wyllie''
* Finish example MAPPs (one for increase and decrease).
* Create sanity check table (with P-values).
* Upload deliverables.
* Begin PowerPoint.
* Begin report.
|
''Work with Veronica Pacheco''
* Finish example MAPPs (one for increase and decrease).
* Create sanity check table (with P-values).
* Upload deliverables.
* Begin PowerPoint.
* Begin report.
|-
!scope="row"|'''Progress'''
|
* Created a new build to fix problem in TallyEngine
|
* Checked final export: everything appears in order
* Checked new build of GenMAPP builder
* Realized why the UniProt count of 6994 is so different from the one reported by our database (difference between protein entry/gene name)
|
* Finished example MAPP (decided do chose one pathway, oxidative phosphorylation).
* Created sanity check table (with P-values)
* Uploaded deliverables
* Began and finished PowerPoint.
* Began report.
|
* Finished example MAPP (decided do chose one pathway, oxidative phosphorylation).
* Created sanity check table (with P-values)
* Uploaded deliverables
* Began and finished PowerPoint.
* Began report.
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 15]]
|
*[[Blitvak Week 15]]
|
*[[Vpachec3 Week 15]]
|
*[[Kevin Wyllie Week 15]]
|-
!scope="row"|'''Files Used/Created'''
|
[[Media:ReadMe_Bc-Std_GEN_Build4_20151204.doc.zip | readMe_Bc-Std_GEN_Build4_20151204.doc.zip]] 
[[Media:gmbuilder-genialomics-20151210-build-5.zip | gmbuilder-genialomics-20151210-build-5.zip]] 
[[Media:Genialomics-DatabaseSchema-20151211.pdf | Genialomics-DatabaseSchema-20151211.pdf]]
|
|
|
* Generated sanity check table.
* Upload deliverables.
|}

==Other Progress==
* On December 14, 2015, GENialOMICS completed a presentation summarizing their methods and findings.
** [[Media:Genialomics-BioDBFinalPresentation.pdf | GENialOMICS Final Presentation]]

=Deliverables=
*See [[GENialOMICS Deliverables|GENialOMICS Deliverables]]
=== Group Deliverables ===
* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

GENialOMICS Deliverables

2015-12-13T01:23:58Z

Vpachec3: /* Group Files and Datasets */ added decrease filtered spreadsheet

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Include Gene Database Schema diagram in ReadMe
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* [[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')]]
* [[media:For_genMAPP_KWVP20151205.txt|Data file used for import into GenMAPP (''.txt'' or ''.csv'')]]
* [[media:KWVP20151205.gex|GenMAPP Expression Dataset file (''.gex'')]]
* [[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file of data imported into GenMAPP (''.EX.txt'')]]
* Raw MAPPFinder results files (''-GO.txt'')
** [[media:KWVP_MAPPfinder_20151208-increase-GO.txt|Increase]]
** [[media:KWVP_MAPPfinder_20151208-decrease-GO.txt|Decrease]]
* [[media:KWVP20151205.gmf|''.gmf'' file]]
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
** [[media:KWVP_MAPPfinder_20151208-increase-GO.xlsx|Increase]]
** [[media:Vpkwmappfinder20151205-Criterion1-GO-decreased.xlsx|Decrease]]
*[[media:Oxphosmappkwvp20151212.mapp|Sample MAPP file of a relevant biological pathway for your species (''.mapp'')]]
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

File:Vpkwmappfinder20151205-Criterion1-GO-decreased.xlsx

2015-12-13T01:23:07Z

Vpachec3:

GENialOMICS Deliverables

2015-12-13T01:21:02Z

Vpachec3: /* Group Files and Datasets */ added .mapp file

== Group Files and Datasets ==

* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Include Gene Database Schema diagram in ReadMe
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* [[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')]]
* [[media:For_genMAPP_KWVP20151205.txt|Data file used for import into GenMAPP (''.txt'' or ''.csv'')]]
* [[media:KWVP20151205.gex|GenMAPP Expression Dataset file (''.gex'')]]
* [[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file of data imported into GenMAPP (''.EX.txt'')]]
* Raw MAPPFinder results files (''-GO.txt'')
** [[media:KWVP_MAPPfinder_20151208-decrease-GO.txt|Decrease]]
** [[media:KWVP_MAPPfinder_20151208-increase-GO.txt|Increase]]
* [[media:KWVP20151205.gmf|''.gmf'' file]]
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
** [[media:KWVP_MAPPfinder_20151208-increase-GO.xlsx|Increase]]
** Decrease
*[[media:Oxphosmappkwvp20151212.mapp|Sample MAPP file of a relevant biological pathway for your species (''.mapp'')]]
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

File:Oxphosmappkwvp20151212.mapp

2015-12-13T01:18:56Z

Vpachec3:

Vpachec3 Week 14

2015-12-08T22:35:49Z

Vpachec3: /* Sunday, December 6 */ fixed name

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.
We named each biofilm column "consolidated_Biofilm_#_scaled_centered_4. There are 5 columns for biofilms.
We named each tobramycin column "consolidated-Tobramycin_#_scaled and centered_4. There are 3 columns for tobramycin.

This would all be on the first sheet named "quadruplicate_spots_separated"

*We then started a new worksheet and called it "statistics"
On this page we have the following information from the previous worksheet: ID column, MasterIndex Column, the consolidated biofilms columns and the consolidated tobramycin columns.
The additional columns added to the statistics worksheet are the following:

Average of the Consolidated biofilm columns

Formula:=AVERAGE(C2:G2)

Average of the Consolidated tobramycin columns

Formula:=AVERAGE(H2:J2)

Biofilm_Tobramycin_ratio column

Formula: =L2-K2

'''NOTE''': Our reference sample is genomic DNA which means that we need the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Subtracting the biofilm average from the tobramycin average works because the numbers are in log space.

Pvalue column

Formula:=TTEST(C2:G2,H2:J2,2,3)

'''NOTE''': The function used is a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples.

* We then started a new worksheet and called it "boneferroni_pval". This is where we are going to see what the values using the Bonferroni adjustments.

Columns copied and pasted from previous sheet: ID, MasterIndex, Biofilm_Tobramycin_ration, and p_value.

The following are the new columns added onto this sheet and their respective formulas:

boneferroni_p_value

Formula: =D2*7252

'''NOTE''': This formula was applied only to the first cell in the column and we used the small black square at the bottom of the cell to have it apply accordingly to the rest of the cells in the column.

bonferroni_p_value

Formula:=IF(E2>1,1,E2)

'''NOTE''': This column is different from the previous column (although named the same way) because the function takes the values form the previous boneferroni_p_value and it makes all the values greater than 1 is marked "1" in this new column and values less than one can stay in their original value in the new column.

*Added another worksheet and named it "BH_pval". This worksheet we will have the Benjamin and Hochberg p value calculations.

Columns copied and pasted from the previous worksheet: ID, MasterIndex and p_value

Before adding any new columns, we organized the p_value column in ascending order.

We then added a "Rank" column and put a 1 in the first cell and a 2 in the second cell. The numbering was applied all the way down for the rest of the cells using the black box short cut. This is the start of the new columns to this worksheet, the rest of the news ones are as follows:

BH_p_value

Formula:=C2*D2

'''NOTE''': This formal shows that this column is the value of the result the p_value multiplied by the rank. Again, it is important o t note that this is in ascending order so the smallest p value is multiplied by the smallest rank.

BH_Pvalue

Formula:=IF(E2>1,1,E2)

*The last worksheet we added was named "forGenMAPP". This was specifically named so that the program GenMapp can recognize and pull the data from this specific worksheet.

Columns copied and pasted from previous worksheets:ID,all consolidated columns for the biofilms and tobramycin, Avg_Biofilm_scaled_centered,Avg_Tobramycin_scaled_centered, Biofilm_Tobramycin_ratio, Pvalue, Bonferroni_Pvalue and BH_Pvalue.

'''NOTE''': The bonferroni and BH columns copied and pasted were the plain formula columns, ''not'' the IF statement columns.

There was only one new column added. In between the ID column and the first consolidate biofilm column, we added the column called "SystemCode". This was for formatting purposes for GenMAPP. Every cell in that column was designated a N.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols] used in the Vibrio run we did in week 8.

*NOTE: I ran the data looking at the tree through the DECREASED perspective while Kevin was through the INCREASED perspective.

[[File:Screenshoterrorwindow12052015.png|800px]]

Caption: This screen shot is the window of the errors produced once the data was run through GennMAPP.

[[File:Expandedtree20151205.png|800px]]

Caption: This is the initial tree that comes up after putting in the criterion in GennMapp.

[[File:Rankedgeneontologyresults20151205.png|800px]]

Caption: After collapsing the expanded tree, we pushed the "Show Ranked List" button to get the Gene Ontology Results as seen in this screenshot.

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.

We took the data from Van Acker et. al paper which is represented in Table 5[[http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943]].

We transferred the table onto Excel.

We inserted a column between the column "Van Acker et al" and "qPCR" column and we called it "US".
NOTE: This spreadsheet is still being worked on thus the headers are not set to these names.

In the "Us" column we inserted the " Biofilm_Tobramycin_ratio" values from the forGenMAPP worksheet for the corresponding gene.

The values we inserted were done at random and not yet complete. We stopped because Kevin and I noticed that there are large discrepancies between the values we have transformed and the ones reported. Although, the direction seems to be almost complimentary, the values themselves are very different.

At this point, we emailed Dr. Dahlquist and Dr.Dionisio for further instruction for our concern.

UPDATE: After we received a response to our conundrum, the sanity check is now complete thanks to the diligent work of [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Kwyllie| Kevin Wyllie]

===Sanity Check: Compare individual genes with known data===
OLD VERSION: [[File:KWVP sanitycheck 20151206.xlsx]]

UPDATED VERSION: [[file:KWVP_Sanitycheck.png|center]]

In the spreadsheet, the values considered significant are marked in '''red'''.

==Links==

[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]
{{Template:Vpachec3 journal links}}
[[Category: Journal Entry]]

==GÉNialOMICS Links==
{{Template:GÉNialOMICS}}
[[Category: Journal Entry]]

GENialOMICS

2015-12-08T22:35:10Z

Vpachec3: /* Veronica Pacheco */ fixed name

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Create testing reports on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
* It was found that the gene names of interest existed under the label of "ORF" in the XML file (explained why they weren't captured by GenMAPP builder)
* Created a streamlined general expression that can capture all of the IDs of interest; shared with Anu, assisted in the creation of Build 3/4 of the modified genmapp builder
* Found that XMLPipeDB Match gave 6 extra counts using the general expression for the IDs; excel MATCH command analysis was conducted and the discrepant IDs were found
* Completed gene database test reports for builds 2, 3, and 4
|
''Work with Kevin Wyllie''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|
''Work with Veronica Pacheco''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
*[[GÉNialOMICS Gene Database Testing Report (Build 2 Export)|Build 2 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 3 Export)|Build 3 Export Testing Report]]
*[[GÉNialOMICS Gene Database Testing Report (Build 4 Export)|Build 4 Export Testing Report]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:GEN BL14 20151201.zip|Files from work done on 12/01/15]]
*[[Media:GEN BL14 20151203.zip|Files from work done on 12/03/15]]
*[[Media:GEN BL14 20151204.zip|Files from work done on 12/04/15]]
*[[Media:GEN BL14 20151207.zip|Files from work done on 12/07/15]]
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
| see Kevin's section -->
|
*[[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Final processed/formatted microarray data spreadsheet]]
*[[media:For_genMAPP_KWVP20151205.txt|txt form of the forGenMAPP sheet]]
*[[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

==Reflections==
===Anu Varshneya===
*What worked?
**In general, I think our group worked very well together! I think we are all motivated to get this project done well, and are communicating well with each other regarding our progress.
*What didn't work?
**I think for the most part we did a great job. I think the only ideas I have moving forward is a little bit more planning in regards to how we plan to attack the writing and presentation portion of the project. I am not concerned about us getting it done on time and with good quality, just that we create a plan of attack soon so that everyone is on the same page. :)
*What will I do next to fix what didn't work?
**Though nothing has not worked, I think we will just talk tomorrow about how we want to approach the writing and the presentation and set up some group work times.

===Kevin Wyllie===
# What worked?
#* Our initial GenMAPP import worked! 284 errors, which, out of 7251, does sound so bad to me!
# What didn't work?
#* Maybe this isn't actually an example of something not working, but our calculated fold changes were quite different (much lower in magnitude) from those reported in Van Acker et al's paper. However, they had the same directions and generally saw the same relative trends (ie the relatively higher fold changes in the paper were among the higher in our data). Also, very few of the genes they considered significant (with their super-lenient criteria that results in 30% of the genes seeing significant changes) were significant by our criteria.
# What will I do next to fix what didn't work?
#* We just need to triple/quadruple check that our data processing protocol is legitimate. Other than that, there's not much we can do in terms of fold changes. And for statistical significance, we potentially should reconsider heightening our BH P-value threshold above 0.05, as currently we're only considering about 8% of the genes to see a significant change. But maybe this is not too low of a number.

===Veronica Pacheco===
#What worked?
#* Right off the bat, our first run through GenMAPP worked. As expected, there were exceptions and it generated an EX.txt file. There were 284 errors. We then handed over the file to Brandon so we shall see if the number of errors can decrease.
#What didn't work?
#*What we initially thought didn't work or didn't seem correct was that fact that our values for fold changes were much smaller than the values reported in Van Acker et al's paper. The direction, for the majority, aligned with what was reported however the concrete values had large differences.
#What will I do next to fix what didn't work?
#*Initially, we went to make sure our methods were correct. We traced back our steps and made sure the calculations were done correctly.After we double checked, we then sought help from Dr. Dahlquist and Dr. Dionisio. From their response, it seems we should go over it one more time and if there is no source of error on our part, we continue the project with our fold changes.

===Brandon Litvak===
#What worked?
#*I think a lot of things worked this week. Team work and communication was a great help in getting the bulk of this week's work done. The initial exported database was not working as planned and, as a team, we discovered that the reason had to do with the fact that GenMAPP builder was utilizing the wrong type of gene name; this knowledge allowed us to create builds this week that happened to work fairly well. These new builds covered the gene names of interest and led to a relatively small amount errors in GenMAPP. I think that, above all, the thing that worked best this week was my team. We were able to communicate and collaborate very well.
#What didn't work?
#*At the present moment, I can't really think of things that really did not work. With respect to the gene database project for J2315, everything appears to be on track; I would say that the major problems that were encountered in Week 14 were resolved. I feel that all of the major work for the project is complete; all that remains, is to synthesize the work done in a paper and presentation. As a group, we did get little work done on the final deliverables (which should be the focus, for this week) but we did get a lot of valuable work done for the project. We haven't managed to plan much regarding the final deliverables, either (but this is a minor issue).
#What will I do next to fix what didn't work?
#*We will need to meet as a group and discuss the state of the project. I personally feel really good about the work so far and it would be helpful to hear, with the bulk of the work done, how everyone else feels. Additionally, I think that we will need to plan out our approach for the final project as soon as possible. Once we have discussed the project and made a plan, I think that we should set aside some time to work on the group project, as a team. I will check in with the group members on Tuesday, share my major findings for the week, and discuss future courses of action (regarding the last bits of the project).

=Deliverables=
=== Group Deliverables ===

* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

Vpachec3 Week 14

2015-12-08T06:34:55Z

Vpachec3: /* Sunday, December 6 */ added new saint check table

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.
We named each biofilm column "consolidated_Biofilm_#_scaled_centered_4. There are 5 columns for biofilms.
We named each tobramycin column "consolidated-Tobramycin_#_scaled and centered_4. There are 3 columns for tobramycin.

This would all be on the first sheet named "quadruplicate_spots_separated"

*We then started a new worksheet and called it "statistics"
On this page we have the following information from the previous worksheet: ID column, MasterIndex Column, the consolidated biofilms columns and the consolidated tobramycin columns.
The additional columns added to the statistics worksheet are the following:

Average of the Consolidated biofilm columns

Formula:=AVERAGE(C2:G2)

Average of the Consolidated tobramycin columns

Formula:=AVERAGE(H2:J2)

Biofilm_Tobramycin_ratio column

Formula: =L2-K2

'''NOTE''': Our reference sample is genomic DNA which means that we need the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Subtracting the biofilm average from the tobramycin average works because the numbers are in log space.

Pvalue column

Formula:=TTEST(C2:G2,H2:J2,2,3)

'''NOTE''': The function used is a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples.

* We then started a new worksheet and called it "boneferroni_pval". This is where we are going to see what the values using the Bonferroni adjustments.

Columns copied and pasted from previous sheet: ID, MasterIndex, Biofilm_Tobramycin_ration, and p_value.

The following are the new columns added onto this sheet and their respective formulas:

boneferroni_p_value

Formula: =D2*7252

'''NOTE''': This formula was applied only to the first cell in the column and we used the small black square at the bottom of the cell to have it apply accordingly to the rest of the cells in the column.

bonferroni_p_value

Formula:=IF(E2>1,1,E2)

'''NOTE''': This column is different from the previous column (although named the same way) because the function takes the values form the previous boneferroni_p_value and it makes all the values greater than 1 is marked "1" in this new column and values less than one can stay in their original value in the new column.

*Added another worksheet and named it "BH_pval". This worksheet we will have the Benjamin and Hochberg p value calculations.

Columns copied and pasted from the previous worksheet: ID, MasterIndex and p_value

Before adding any new columns, we organized the p_value column in ascending order.

We then added a "Rank" column and put a 1 in the first cell and a 2 in the second cell. The numbering was applied all the way down for the rest of the cells using the black box short cut. This is the start of the new columns to this worksheet, the rest of the news ones are as follows:

BH_p_value

Formula:=C2*D2

'''NOTE''': This formal shows that this column is the value of the result the p_value multiplied by the rank. Again, it is important o t note that this is in ascending order so the smallest p value is multiplied by the smallest rank.

BH_Pvalue

Formula:=IF(E2>1,1,E2)

*The last worksheet we added was named "forGenMAPP". This was specifically named so that the program GenMapp can recognize and pull the data from this specific worksheet.

Columns copied and pasted from previous worksheets:ID,all consolidated columns for the biofilms and tobramycin, Avg_Biofilm_scaled_centered,Avg_Tobramycin_scaled_centered, Biofilm_Tobramycin_ratio, Pvalue, Bonferroni_Pvalue and BH_Pvalue.

'''NOTE''': The bonferroni and BH columns copied and pasted were the plain formula columns, ''not'' the IF statement columns.

There was only one new column added. In between the ID column and the first consolidate biofilm column, we added the column called "SystemCode". This was for formatting purposes for GenMAPP. Every cell in that column was designated a N.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols] used in the Vibrio run we did in week 8.

*NOTE: I ran the data looking at the tree through the DECREASED perspective while Kevin was through the INCREASED perspective.

[[File:Screenshoterrorwindow12052015.png|800px]]

Caption: This screen shot is the window of the errors produced once the data was run through GennMAPP.

[[File:Expandedtree20151205.png|800px]]

Caption: This is the initial tree that comes up after putting in the criterion in GennMapp.

[[File:Rankedgeneontologyresults20151205.png|800px]]

Caption: After collapsing the expanded tree, we pushed the "Show Ranked List" button to get the Gene Ontology Results as seen in this screenshot.

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.

We took the data from Van Acker et. al paper which is represented in Table 5[[http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943]].

We transferred the table onto Excel.

We inserted a column between the column "Van Acker et al" and "qPCR" column and we called it "US".
NOTE: This spreadsheet is still being worked on thus the headers are not set to these names.

In the "Us" column we inserted the " Biofilm_Tobramycin_ratio" values from the forGenMAPP worksheet for the corresponding gene.

The values we inserted were done at random and not yet complete. We stopped because Kevin and I noticed that there are large discrepancies between the values we have transformed and the ones reported. Although, the direction seems to be almost complimentary, the values themselves are very different.

At this point, we emailed Dr. Dahlquist and Dr.Donisio for further instruction for our concern.

UPDATE: After we received a response to our conundrum, the sanity check is now complete thanks to the diligent work of [https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Kwyllie| Kevin Wyllie]

===Sanity Check: Compare individual genes with known data===
OLD VERSION: [[File:KWVP sanitycheck 20151206.xlsx]]

UPDATED VERSION: [[file:KWVP_Sanitycheck.png|center]]

In the spreadsheet, the values considered significant are marked in '''red'''.

==Links==

[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]
{{Template:Vpachec3 journal links}}
[[Category: Journal Entry]]

==GÉNialOMICS Links==
{{Template:GÉNialOMICS}}
[[Category: Journal Entry]]

GENialOMICS

2015-12-08T06:25:36Z

Vpachec3: /* Veronica Pacheco */ spelling and syntax

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Perform testing report on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
|
''Work with Kevin Wyllie''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|
''Work with Veronica Pacheco''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
| see Kevin's section -->
|
*[[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Final processed/formatted microarray data spreadsheet]]
*[[media:For_genMAPP_KWVP20151205.txt|txt form of the forGenMAPP sheet]]
*[[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

==Reflections==
===Anu Varshneya===
*What worked?
**In general, I think our group worked very well together! I think we are all motivated to get this project done well, and are communicating well with each other regarding our progress.
*What didn't work?
**I think for the most part we did a great job. I think the only ideas I have moving forward is a little bit more planning in regards to how we plan to attack the writing and presentation portion of the project. I am not concerned about us getting it done on time and with good quality, just that we create a plan of attack soon so that everyone is on the same page. :)
*What will I do next to fix what didn't work?
**Though nothing has not worked, I think we will just talk tomorrow about how we want to approach the writing and the presentation and set up some group work times.

===Kevin Wyllie===
# What worked?
#* Our initial GenMAPP import worked! 284 errors, which, out of 7251, does sound so bad to me!
# What didn't work?
#* Maybe this isn't actually an example of something not working, but our calculated fold changes were quite different (much lower in magnitude) from those reported in Van Acker et al's paper. However, they had the same directions and generally saw the same relative trends (ie the relatively higher fold changes in the paper were among the higher in our data). Also, very few of the genes they considered significant (with their super-lenient criteria that results in 30% of the genes seeing significant changes) were significant by our criteria.
# What will I do next to fix what didn't work?
#* We just need to triple/quadruple check that our data processing protocol is legitimate. Other than that, there's not much we can do in terms of fold changes. And for statistical significance, we potentially should reconsider heightening our BH P-value threshold above 0.05, as currently we're only considering about 8% of the genes to see a significant change. But maybe this is not too low of a number.

===Veronica Pacheco===
#What worked?
#* Right off the bat, our first run through GenMAPP worked. As expected, there were exceptions and it generated an EX.txt file. There were 284 errors. We then handed over the file to Brandon so we shall see if the number of errors can decrease.
#What didn't work?
#*What we initially thought didn't work or didn't seem correct was that fact that our values for fold changes were much smaller than the values reported in Van Acker et al's paper. The direction, for the majority, aligned with what was reported however the concrete values had large differences.
#What will I do next to fix what didn't work?
#*Initially, we went to make sure our methods were correct. We traced back our steps and made sure the calculations were done correctly.After we double checked, we then sought help from Dr. Dahlquist and Dr. Donisio. From their response, it seems we should go over it one more time and if there is no source of error on our part, we continue the project with our fold changes.

=Deliverables=
=== Group Deliverables ===

* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

GENialOMICS

2015-12-08T06:23:08Z

Vpachec3: /* Reflections */ added my section and answers

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Perform testing report on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
|
''Work with Kevin Wyllie''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|
''Work with Veronica Pacheco''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
| see Kevin's section -->
|
*[[media:Raw_compiled_data_FINAL_KWVP20151507.xlsx|Final processed/formatted microarray data spreadsheet]]
*[[media:For_genMAPP_KWVP20151205.txt|txt form of the forGenMAPP sheet]]
*[[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

==Reflections==
===Anu Varshneya===
*What worked?
**In general, I think our group worked very well together! I think we are all motivated to get this project done well, and are communicating well with each other regarding our progress.
*What didn't work?
**I think for the most part we did a great job. I think the only ideas I have moving forward is a little bit more planning in regards to how we plan to attack the writing and presentation portion of the project. I am not concerned about us getting it done on time and with good quality, just that we create a plan of attack soon so that everyone is on the same page. :)
*What will I do next to fix what didn't work?
**Though nothing has not worked, I think we will just talk tomorrow about how we want to approach the writing and the presentation and set up some group work times.

===Kevin Wyllie===
# What worked?
#* Our initial GenMAPP import worked! 284 errors, which, out of 7251, does sound so bad to me!
# What didn't work?
#* Maybe this isn't actually an example of something not working, but our calculated fold changes were quite different (much lower in magnitude) from those reported in Van Acker et al's paper. However, they had the same directions and generally saw the same relative trends (ie the relatively higher fold changes in the paper were among the higher in our data). Also, very few of the genes they considered significant (with their super-lenient criteria that results in 30% of the genes seeing significant changes) were significant by our criteria.
# What will I do next to fix what didn't work?
#* We just need to triple/quadruple check that our data processing protocol is legitimate. Other than that, there's not much we can do in terms of fold changes. And for statistical significance, we potentially should reconsider heightening our BH P-value threshold above 0.05, as currently we're only considering about 8% of the genes to see a significant change. But maybe this is not too low of a number.

===Veronica Pacheco===
#What worked?
#* Right off the bat, our first run through GenMAPP worked. As expected, there were exceptions and it generated an EX.txt file. There were 284 errors. We then handed over the file to Brandon so we shall see if the number of errors can decrease.
#What didn't work?
#*What we initially thought didn't work or didn't seem correct was that fact that our values for fold changes were much smaller than the values reported in Van Acker et al's paper. The direction, for the majority, aligned with what was reported however the concrete values had large differences.
#What will I do next to fix what didn't work?
#*Initially, we went to make sure our methods were correct. We traced back our steps and made sure the calculations were done correctly.After we double checked, we then sought help from Dr. Dahlquist and Dr. Donisio. From their response, it seems we should go over it one more time and if nothing seem wrong to continue the project with our fold changes.

=Deliverables=
=== Group Deliverables ===

* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

Vpachec3 Week 14

2015-12-08T00:26:16Z

Vpachec3: /* Links */ new links

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.
We named each biofilm column "consolidated_Biofilm_#_scaled_centered_4. There are 5 columns for biofilms.
We named each tobramycin column "consolidated-Tobramycin_#_scaled and centered_4. There are 3 columns for tobramycin.

This would all be on the first sheet named "quadruplicate_spots_separated"

*We then started a new worksheet and called it "statistics"
On this page we have the following information from the previous worksheet: ID column, MasterIndex Column, the consolidated biofilms columns and the consolidated tobramycin columns.
The additional columns added to the statistics worksheet are the following:

Average of the Consolidated biofilm columns

Formula:=AVERAGE(C2:G2)

Average of the Consolidated tobramycin columns

Formula:=AVERAGE(H2:J2)

Biofilm_Tobramycin_ratio column

Formula: =L2-K2

'''NOTE''': Our reference sample is genomic DNA which means that we need the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Subtracting the biofilm average from the tobramycin average works because the numbers are in log space.

Pvalue column

Formula:=TTEST(C2:G2,H2:J2,2,3)

'''NOTE''': The function used is a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples.

* We then started a new worksheet and called it "boneferroni_pval". This is where we are going to see what the values using the Bonferroni adjustments.

Columns copied and pasted from previous sheet: ID, MasterIndex, Biofilm_Tobramycin_ration, and p_value.

The following are the new columns added onto this sheet and their respective formulas:

boneferroni_p_value

Formula: =D2*7252

'''NOTE''': This formula was applied only to the first cell in the column and we used the small black square at the bottom of the cell to have it apply accordingly to the rest of the cells in the column.

bonferroni_p_value

Formula:=IF(E2>1,1,E2)

'''NOTE''': This column is different from the previous column (although named the same way) because the function takes the values form the previous boneferroni_p_value and it makes all the values greater than 1 is marked "1" in this new column and values less than one can stay in their original value in the new column.

*Added another worksheet and named it "BH_pval". This worksheet we will have the Benjamin and Hochberg p value calculations.

Columns copied and pasted from the previous worksheet: ID, MasterIndex and p_value

Before adding any new columns, we organized the p_value column in ascending order.

We then added a "Rank" column and put a 1 in the first cell and a 2 in the second cell. The numbering was applied all the way down for the rest of the cells using the black box short cut. This is the start of the new columns to this worksheet, the rest of the news ones are as follows:

BH_p_value

Formula:=C2*D2

'''NOTE''': This formal shows that this column is the value of the result the p_value multiplied by the rank. Again, it is important o t note that this is in ascending order so the smallest p value is multiplied by the smallest rank.

BH_Pvalue

Formula:=IF(E2>1,1,E2)

*The last worksheet we added was named "forGenMAPP". This was specifically named so that the program GenMapp can recognize and pull the data from this specific worksheet.

Columns copied and pasted from previous worksheets:ID,all consolidated columns for the biofilms and tobramycin, Avg_Biofilm_scaled_centered,Avg_Tobramycin_scaled_centered, Biofilm_Tobramycin_ratio, Pvalue, Bonferroni_Pvalue and BH_Pvalue.

'''NOTE''': The bonferroni and BH columns copied and pasted were the plain formula columns, ''not'' the IF statement columns.

There was only one new column added. In between the ID column and the first consolidate biofilm column, we added the column called "SystemCode". This was for formatting purposes for GenMAPP. Every cell in that column was designated a N.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols] used in the Vibrio run we did in week 8.

*NOTE: I ran the data looking at the tree through the DECREASED perspective while Kevin was through the INCREASED perspective.

[[File:Screenshoterrorwindow12052015.png|800px]]

Caption: This screen shot is the window of the errors produced once the data was run through GennMAPP.

[[File:Expandedtree20151205.png|800px]]

Caption: This is the initial tree that comes up after putting in the criterion in GennMapp.

[[File:Rankedgeneontologyresults20151205.png|800px]]

Caption: After collapsing the expanded tree, we pushed the "Show Ranked List" button to get the Gene Ontology Results as seen in this screenshot.

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.

We took the data from Van Acker et. al paper which is represented in Table 5[[http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943]].

We transferred the table onto Excel.

We inserted a column between the column "Van Acker et al" and "qPCR" column and we called it "US".
NOTE: This spreadsheet is still being worked on thus the headers are not set to these names.

In the "Us" column we inserted the " Biofilm_Tobramycin_ratio" values from the forGenMAPP worksheet for the corresponding gene.

The values we inserted were done at random and not yet complete. We stopped because Kevin and I noticed that there are large discrepancies between the values we have transformed and the ones reported. Although, the direction seems to be almost complimentary, the values themselves are very different.

At this point, we emailed Dr. Dahlquist and Dr.Donisio for further instruction for our concern.

===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx]]

In the spreadsheet, the values considered significant are marked in '''red'''.

==Links==

[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]
{{Template:Vpachec3 journal links}}
[[Category: Journal Entry]]

==GÉNialOMICS Links==
{{Template:GÉNialOMICS}}
[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-08T00:25:12Z

Vpachec3: /* Links */ added template Vpachec3 journal links

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.
We named each biofilm column "consolidated_Biofilm_#_scaled_centered_4. There are 5 columns for biofilms.
We named each tobramycin column "consolidated-Tobramycin_#_scaled and centered_4. There are 3 columns for tobramycin.

This would all be on the first sheet named "quadruplicate_spots_separated"

*We then started a new worksheet and called it "statistics"
On this page we have the following information from the previous worksheet: ID column, MasterIndex Column, the consolidated biofilms columns and the consolidated tobramycin columns.
The additional columns added to the statistics worksheet are the following:

Average of the Consolidated biofilm columns

Formula:=AVERAGE(C2:G2)

Average of the Consolidated tobramycin columns

Formula:=AVERAGE(H2:J2)

Biofilm_Tobramycin_ratio column

Formula: =L2-K2

'''NOTE''': Our reference sample is genomic DNA which means that we need the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Subtracting the biofilm average from the tobramycin average works because the numbers are in log space.

Pvalue column

Formula:=TTEST(C2:G2,H2:J2,2,3)

'''NOTE''': The function used is a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples.

* We then started a new worksheet and called it "boneferroni_pval". This is where we are going to see what the values using the Bonferroni adjustments.

Columns copied and pasted from previous sheet: ID, MasterIndex, Biofilm_Tobramycin_ration, and p_value.

The following are the new columns added onto this sheet and their respective formulas:

boneferroni_p_value

Formula: =D2*7252

'''NOTE''': This formula was applied only to the first cell in the column and we used the small black square at the bottom of the cell to have it apply accordingly to the rest of the cells in the column.

bonferroni_p_value

Formula:=IF(E2>1,1,E2)

'''NOTE''': This column is different from the previous column (although named the same way) because the function takes the values form the previous boneferroni_p_value and it makes all the values greater than 1 is marked "1" in this new column and values less than one can stay in their original value in the new column.

*Added another worksheet and named it "BH_pval". This worksheet we will have the Benjamin and Hochberg p value calculations.

Columns copied and pasted from the previous worksheet: ID, MasterIndex and p_value

Before adding any new columns, we organized the p_value column in ascending order.

We then added a "Rank" column and put a 1 in the first cell and a 2 in the second cell. The numbering was applied all the way down for the rest of the cells using the black box short cut. This is the start of the new columns to this worksheet, the rest of the news ones are as follows:

BH_p_value

Formula:=C2*D2

'''NOTE''': This formal shows that this column is the value of the result the p_value multiplied by the rank. Again, it is important o t note that this is in ascending order so the smallest p value is multiplied by the smallest rank.

BH_Pvalue

Formula:=IF(E2>1,1,E2)

*The last worksheet we added was named "forGenMAPP". This was specifically named so that the program GenMapp can recognize and pull the data from this specific worksheet.

Columns copied and pasted from previous worksheets:ID,all consolidated columns for the biofilms and tobramycin, Avg_Biofilm_scaled_centered,Avg_Tobramycin_scaled_centered, Biofilm_Tobramycin_ratio, Pvalue, Bonferroni_Pvalue and BH_Pvalue.

'''NOTE''': The bonferroni and BH columns copied and pasted were the plain formula columns, ''not'' the IF statement columns.

There was only one new column added. In between the ID column and the first consolidate biofilm column, we added the column called "SystemCode". This was for formatting purposes for GenMAPP. Every cell in that column was designated a N.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols] used in the Vibrio run we did in week 8.

*NOTE: I ran the data looking at the tree through the DECREASED perspective while Kevin was through the INCREASED perspective.

[[File:Screenshoterrorwindow12052015.png|800px]]

Caption: This screen shot is the window of the errors produced once the data was run through GennMAPP.

[[File:Expandedtree20151205.png|800px]]

Caption: This is the initial tree that comes up after putting in the criterion in GennMapp.

[[File:Rankedgeneontologyresults20151205.png|800px]]

Caption: After collapsing the expanded tree, we pushed the "Show Ranked List" button to get the Gene Ontology Results as seen in this screenshot.

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.

We took the data from Van Acker et. al paper which is represented in Table 5[[http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943]].

We transferred the table onto Excel.

We inserted a column between the column "Van Acker et al" and "qPCR" column and we called it "US".
NOTE: This spreadsheet is still being worked on thus the headers are not set to these names.

In the "Us" column we inserted the " Biofilm_Tobramycin_ratio" values from the forGenMAPP worksheet for the corresponding gene.

The values we inserted were done at random and not yet complete. We stopped because Kevin and I noticed that there are large discrepancies between the values we have transformed and the ones reported. Although, the direction seems to be almost complimentary, the values themselves are very different.

At this point, we emailed Dr. Dahlquist and Dr.Donisio for further instruction for our concern.

===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx]]

In the spreadsheet, the values considered significant are marked in '''red'''.

==Links==

[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]
{{Template:Vpachec3 journal links}}
[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-08T00:23:49Z

Vpachec3: /* Sunday, December 6 */ added protocol for sunday

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.
We named each biofilm column "consolidated_Biofilm_#_scaled_centered_4. There are 5 columns for biofilms.
We named each tobramycin column "consolidated-Tobramycin_#_scaled and centered_4. There are 3 columns for tobramycin.

This would all be on the first sheet named "quadruplicate_spots_separated"

*We then started a new worksheet and called it "statistics"
On this page we have the following information from the previous worksheet: ID column, MasterIndex Column, the consolidated biofilms columns and the consolidated tobramycin columns.
The additional columns added to the statistics worksheet are the following:

Average of the Consolidated biofilm columns

Formula:=AVERAGE(C2:G2)

Average of the Consolidated tobramycin columns

Formula:=AVERAGE(H2:J2)

Biofilm_Tobramycin_ratio column

Formula: =L2-K2

'''NOTE''': Our reference sample is genomic DNA which means that we need the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Subtracting the biofilm average from the tobramycin average works because the numbers are in log space.

Pvalue column

Formula:=TTEST(C2:G2,H2:J2,2,3)

'''NOTE''': The function used is a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples.

* We then started a new worksheet and called it "boneferroni_pval". This is where we are going to see what the values using the Bonferroni adjustments.

Columns copied and pasted from previous sheet: ID, MasterIndex, Biofilm_Tobramycin_ration, and p_value.

The following are the new columns added onto this sheet and their respective formulas:

boneferroni_p_value

Formula: =D2*7252

'''NOTE''': This formula was applied only to the first cell in the column and we used the small black square at the bottom of the cell to have it apply accordingly to the rest of the cells in the column.

bonferroni_p_value

Formula:=IF(E2>1,1,E2)

'''NOTE''': This column is different from the previous column (although named the same way) because the function takes the values form the previous boneferroni_p_value and it makes all the values greater than 1 is marked "1" in this new column and values less than one can stay in their original value in the new column.

*Added another worksheet and named it "BH_pval". This worksheet we will have the Benjamin and Hochberg p value calculations.

Columns copied and pasted from the previous worksheet: ID, MasterIndex and p_value

Before adding any new columns, we organized the p_value column in ascending order.

We then added a "Rank" column and put a 1 in the first cell and a 2 in the second cell. The numbering was applied all the way down for the rest of the cells using the black box short cut. This is the start of the new columns to this worksheet, the rest of the news ones are as follows:

BH_p_value

Formula:=C2*D2

'''NOTE''': This formal shows that this column is the value of the result the p_value multiplied by the rank. Again, it is important o t note that this is in ascending order so the smallest p value is multiplied by the smallest rank.

BH_Pvalue

Formula:=IF(E2>1,1,E2)

*The last worksheet we added was named "forGenMAPP". This was specifically named so that the program GenMapp can recognize and pull the data from this specific worksheet.

Columns copied and pasted from previous worksheets:ID,all consolidated columns for the biofilms and tobramycin, Avg_Biofilm_scaled_centered,Avg_Tobramycin_scaled_centered, Biofilm_Tobramycin_ratio, Pvalue, Bonferroni_Pvalue and BH_Pvalue.

'''NOTE''': The bonferroni and BH columns copied and pasted were the plain formula columns, ''not'' the IF statement columns.

There was only one new column added. In between the ID column and the first consolidate biofilm column, we added the column called "SystemCode". This was for formatting purposes for GenMAPP. Every cell in that column was designated a N.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols] used in the Vibrio run we did in week 8.

*NOTE: I ran the data looking at the tree through the DECREASED perspective while Kevin was through the INCREASED perspective.

[[File:Screenshoterrorwindow12052015.png|800px]]

Caption: This screen shot is the window of the errors produced once the data was run through GennMAPP.

[[File:Expandedtree20151205.png|800px]]

Caption: This is the initial tree that comes up after putting in the criterion in GennMapp.

[[File:Rankedgeneontologyresults20151205.png|800px]]

Caption: After collapsing the expanded tree, we pushed the "Show Ranked List" button to get the Gene Ontology Results as seen in this screenshot.

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.

We took the data from Van Acker et. al paper which is represented in Table 5[[http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943]].

We transferred the table onto Excel.

We inserted a column between the column "Van Acker et al" and "qPCR" column and we called it "US".
NOTE: This spreadsheet is still being worked on thus the headers are not set to these names.

In the "Us" column we inserted the " Biofilm_Tobramycin_ratio" values from the forGenMAPP worksheet for the corresponding gene.

The values we inserted were done at random and not yet complete. We stopped because Kevin and I noticed that there are large discrepancies between the values we have transformed and the ones reported. Although, the direction seems to be almost complimentary, the values themselves are very different.

At this point, we emailed Dr. Dahlquist and Dr.Donisio for further instruction for our concern.

===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx]]

In the spreadsheet, the values considered significant are marked in '''red'''.

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-08T00:09:00Z

Vpachec3: /* Saturday, December 5 */ added description to photos

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.
We named each biofilm column "consolidated_Biofilm_#_scaled_centered_4. There are 5 columns for biofilms.
We named each tobramycin column "consolidated-Tobramycin_#_scaled and centered_4. There are 3 columns for tobramycin.

This would all be on the first sheet named "quadruplicate_spots_separated"

*We then started a new worksheet and called it "statistics"
On this page we have the following information from the previous worksheet: ID column, MasterIndex Column, the consolidated biofilms columns and the consolidated tobramycin columns.
The additional columns added to the statistics worksheet are the following:

Average of the Consolidated biofilm columns

Formula:=AVERAGE(C2:G2)

Average of the Consolidated tobramycin columns

Formula:=AVERAGE(H2:J2)

Biofilm_Tobramycin_ratio column

Formula: =L2-K2

'''NOTE''': Our reference sample is genomic DNA which means that we need the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Subtracting the biofilm average from the tobramycin average works because the numbers are in log space.

Pvalue column

Formula:=TTEST(C2:G2,H2:J2,2,3)

'''NOTE''': The function used is a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples.

* We then started a new worksheet and called it "boneferroni_pval". This is where we are going to see what the values using the Bonferroni adjustments.

Columns copied and pasted from previous sheet: ID, MasterIndex, Biofilm_Tobramycin_ration, and p_value.

The following are the new columns added onto this sheet and their respective formulas:

boneferroni_p_value

Formula: =D2*7252

'''NOTE''': This formula was applied only to the first cell in the column and we used the small black square at the bottom of the cell to have it apply accordingly to the rest of the cells in the column.

bonferroni_p_value

Formula:=IF(E2>1,1,E2)

'''NOTE''': This column is different from the previous column (although named the same way) because the function takes the values form the previous boneferroni_p_value and it makes all the values greater than 1 is marked "1" in this new column and values less than one can stay in their original value in the new column.

*Added another worksheet and named it "BH_pval". This worksheet we will have the Benjamin and Hochberg p value calculations.

Columns copied and pasted from the previous worksheet: ID, MasterIndex and p_value

Before adding any new columns, we organized the p_value column in ascending order.

We then added a "Rank" column and put a 1 in the first cell and a 2 in the second cell. The numbering was applied all the way down for the rest of the cells using the black box short cut. This is the start of the new columns to this worksheet, the rest of the news ones are as follows:

BH_p_value

Formula:=C2*D2

'''NOTE''': This formal shows that this column is the value of the result the p_value multiplied by the rank. Again, it is important o t note that this is in ascending order so the smallest p value is multiplied by the smallest rank.

BH_Pvalue

Formula:=IF(E2>1,1,E2)

*The last worksheet we added was named "forGenMAPP". This was specifically named so that the program GenMapp can recognize and pull the data from this specific worksheet.

Columns copied and pasted from previous worksheets:ID,all consolidated columns for the biofilms and tobramycin, Avg_Biofilm_scaled_centered,Avg_Tobramycin_scaled_centered, Biofilm_Tobramycin_ratio, Pvalue, Bonferroni_Pvalue and BH_Pvalue.

'''NOTE''': The bonferroni and BH columns copied and pasted were the plain formula columns, ''not'' the IF statement columns.

There was only one new column added. In between the ID column and the first consolidate biofilm column, we added the column called "SystemCode". This was for formatting purposes for GenMAPP. Every cell in that column was designated a N.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol [http://www.openwetware.org/wiki/BIOL367/F10:GenMAPP_and_MAPPFinder_Protocols] used in the Vibrio run we did in week 8.

*NOTE: I ran the data looking at the tree through the DECREASED perspective while Kevin was through the INCREASED perspective.

[[File:Screenshoterrorwindow12052015.png|800px]]

Caption: This screen shot is the window of the errors produced once the data was run through GennMAPP.

[[File:Expandedtree20151205.png|800px]]

Caption: This is the initial tree that comes up after putting in the criterion in GennMapp.

[[File:Rankedgeneontologyresults20151205.png|800px]]

Caption: After collapsing the expanded tree, we pushed the "Show Ranked List" button to get the Gene Ontology Results as seen in this screenshot.

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.
===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx]]

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-08T00:02:15Z

Vpachec3: /* Thursday,December 3 */ format fixing

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.
We named each biofilm column "consolidated_Biofilm_#_scaled_centered_4. There are 5 columns for biofilms.
We named each tobramycin column "consolidated-Tobramycin_#_scaled and centered_4. There are 3 columns for tobramycin.

This would all be on the first sheet named "quadruplicate_spots_separated"

*We then started a new worksheet and called it "statistics"
On this page we have the following information from the previous worksheet: ID column, MasterIndex Column, the consolidated biofilms columns and the consolidated tobramycin columns.
The additional columns added to the statistics worksheet are the following:

Average of the Consolidated biofilm columns

Formula:=AVERAGE(C2:G2)

Average of the Consolidated tobramycin columns

Formula:=AVERAGE(H2:J2)

Biofilm_Tobramycin_ratio column

Formula: =L2-K2

'''NOTE''': Our reference sample is genomic DNA which means that we need the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Subtracting the biofilm average from the tobramycin average works because the numbers are in log space.

Pvalue column

Formula:=TTEST(C2:G2,H2:J2,2,3)

'''NOTE''': The function used is a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples.

* We then started a new worksheet and called it "boneferroni_pval". This is where we are going to see what the values using the Bonferroni adjustments.

Columns copied and pasted from previous sheet: ID, MasterIndex, Biofilm_Tobramycin_ration, and p_value.

The following are the new columns added onto this sheet and their respective formulas:

boneferroni_p_value

Formula: =D2*7252

'''NOTE''': This formula was applied only to the first cell in the column and we used the small black square at the bottom of the cell to have it apply accordingly to the rest of the cells in the column.

bonferroni_p_value

Formula:=IF(E2>1,1,E2)

'''NOTE''': This column is different from the previous column (although named the same way) because the function takes the values form the previous boneferroni_p_value and it makes all the values greater than 1 is marked "1" in this new column and values less than one can stay in their original value in the new column.

*Added another worksheet and named it "BH_pval". This worksheet we will have the Benjamin and Hochberg p value calculations.

Columns copied and pasted from the previous worksheet: ID, MasterIndex and p_value

Before adding any new columns, we organized the p_value column in ascending order.

We then added a "Rank" column and put a 1 in the first cell and a 2 in the second cell. The numbering was applied all the way down for the rest of the cells using the black box short cut. This is the start of the new columns to this worksheet, the rest of the news ones are as follows:

BH_p_value

Formula:=C2*D2

'''NOTE''': This formal shows that this column is the value of the result the p_value multiplied by the rank. Again, it is important o t note that this is in ascending order so the smallest p value is multiplied by the smallest rank.

BH_Pvalue

Formula:=IF(E2>1,1,E2)

*The last worksheet we added was named "forGenMAPP". This was specifically named so that the program GenMapp can recognize and pull the data from this specific worksheet.

Columns copied and pasted from previous worksheets:ID,all consolidated columns for the biofilms and tobramycin, Avg_Biofilm_scaled_centered,Avg_Tobramycin_scaled_centered, Biofilm_Tobramycin_ratio, Pvalue, Bonferroni_Pvalue and BH_Pvalue.

'''NOTE''': The bonferroni and BH columns copied and pasted were the plain formula columns, ''not'' the IF statement columns.

There was only one new column added. In between the ID column and the first consolidate biofilm column, we added the column called "SystemCode". This was for formatting purposes for GenMAPP. Every cell in that column was designated a N.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol used in the Vibrio run we did in week 8.

[[File:Screenshoterrorwindow12052015.png|800px]]

[[File:Expandedtree20151205.png|800px]]

[[File:Rankedgeneontologyresults20151205.png|800px]]

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.
===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx]]

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-08T00:01:32Z

Vpachec3: /* Thursday,December 3 */ format fixing

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.
We named each biofilm column "consolidated_Biofilm_#_scaled_centered_4. There are 5 columns for biofilms.
We named each tobramycin column "consolidated-Tobramycin_#_scaled and centered_4. There are 3 columns for tobramycin.

This would all be on the first sheet named "quadruplicate_spots_separated"

*We then started a new worksheet and called it "statistics"
On this page we have the following information from the previous worksheet: ID column, MasterIndex Column, the consolidated biofilms columns and the consolidated tobramycin columns.
The additional columns added to the statistics worksheet are the following:

Average of the Consolidated biofilm columns

Formula:=AVERAGE(C2:G2)

Average of the Consolidated tobramycin columns

Formula:=AVERAGE(H2:J2)

Biofilm_Tobramycin_ratio column

Formula: =L2-K2

'''NOTE''': Our reference sample is genomic DNA which means that we need the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Subtracting the biofilm average from the tobramycin average works because the numbers are in log space.

Pvalue column

Formula:=TTEST(C2:G2,H2:J2,2,3)

'''NOTE''': The function used is a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples.

* We then started a new worksheet and called it "boneferroni_pval". This is where we are going to see what the values using the Bonferroni adjustments.

Columns copied and pasted from previous sheet: ID, MasterIndex, Biofilm_Tobramycin_ration, and p_value.

The following are the new columns added onto this sheet and their respective formulas:

boneferroni_p_value

Formula: =D2*7252

'''NOTE''': This formula was applied only to the first cell in the column and we used the small black square at the bottom of the cell to have it apply accordingly to the rest of the cells in the column.

bonferroni_p_value

Formula:=IF(E2>1,1,E2)

'''NOTE''': This column is different from the previous column (although named the same way) because the function takes the values form the previous boneferroni_p_value and it makes all the values greater than 1 is marked "1" in this new column and values less than one can stay in their original value in the new column.

*Added another worksheet and named it "BH_pval". This worksheet we will have the Benjamin and Hochberg p value calculations.

Columns copied and pasted from the previous worksheet: ID, MasterIndex and p_value

Before adding any new columns, we organized the p_value column in ascending order.

We then added a "Rank" column and put a 1 in the first cell and a 2 in the second cell. The numbering was applied all the way down for the rest of the cells using the black box short cut. This is the start of the new columns to this worksheet, the rest of the news ones are as follows:

BH_p_value

Formula:=C2*D2

'''NOTE''': This formal shows that this column is the value of the result the p_value multiplied by the rank. Again, it is important o t note that this is in ascending order so the smallest p value is multiplied by the smallest rank.

BH_Pvalue

Formula:=IF(E2>1,1,E2)

#The last worksheet we added was named "forGenMAPP". This was specifically named so that the program GenMapp can recognize and pull the data from this specific worksheet.

Columns copied and pasted from previous worksheets:ID,all consolidated columns for the biofilms and tobramycin, Avg_Biofilm_scaled_centered,Avg_Tobramycin_scaled_centered, Biofilm_Tobramycin_ratio, Pvalue, Bonferroni_Pvalue and BH_Pvalue.

'''NOTE''': The bonferroni and BH columns copied and pasted were the plain formula columns, ''not'' the IF statement columns.

There was only one new column added. In between the ID column and the first consolidate biofilm column, we added the column called "SystemCode". This was for formatting purposes for GenMAPP. Every cell in that column was designated a N.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol used in the Vibrio run we did in week 8.

[[File:Screenshoterrorwindow12052015.png|800px]]

[[File:Expandedtree20151205.png|800px]]

[[File:Rankedgeneontologyresults20151205.png|800px]]

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.
===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx]]

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-08T00:00:45Z

Vpachec3: /* Thursday,December 3 */

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.
We named each biofilm column "consolidated_Biofilm_#_scaled_centered_4. There are 5 columns for biofilms.
We named each tobramycin column "consolidated-Tobramycin_#_scaled and centered_4. There are 3 columns for tobramycin.

This would all be on the first sheet named "quadruplicate_spots_separated"

*We then started a new worksheet and called it "statistics"
On this page we have the following information from the previous worksheet: ID column, MasterIndex Column, the consolidated biofilms columns and the consolidated tobramycin columns.
The additional columns added to the statistics worksheet are the following:

Average of the Consolidated biofilm columns

Formula:=AVERAGE(C2:G2)

Average of the Consolidated tobramycin columns

Formula:=AVERAGE(H2:J2)

Biofilm_Tobramycin_ratio column

Formula: =L2-K2

'''NOTE''': Our reference sample is genomic DNA which means that we need the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Subtracting the biofilm average from the tobramycin average works because the numbers are in log space.

Pvalue column

Formula:=TTEST(C2:G2,H2:J2,2,3)

'''NOTE''': The function used is a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples.

* We then started a new worksheet and called it "boneferroni_pval". This is where we are going to see what the values using the Bonferroni adjustments.

Columns copied and pasted from previous sheet: ID, MasterIndex, Biofilm_Tobramycin_ration, and p_value.

The following are the new columns added onto this sheet and their respective formulas:

boneferroni_p_value

Formula: =D2*7252

'''NOTE''': This formula was applied only to the first cell in the column and we used the small black square at the bottom of the cell to have it apply accordingly to the rest of the cells in the column.

bonferroni_p_value

Formula:=IF(E2>1,1,E2)

'''NOTE''': This column is different from the previous column (although named the same way) because the function takes the values form the previous boneferroni_p_value and it makes all the values greater than 1 is marked "1" in this new column and values less than one can stay in their original value in the new column.

#Added another worksheet and named it "BH_pval". This worksheet we will have the Benjamin and Hochberg p value calculations.

Columns copied and pasted from the previous worksheet: ID, MasterIndex and p_value

Before adding any new columns, we organized the p_value column in ascending order.

We then added a "Rank" column and put a 1 in the first cell and a 2 in the second cell. The numbering was applied all the way down for the rest of the cells using the black box short cut. This is the start of the new columns to this worksheet, the rest of the news ones are as follows:

BH_p_value

Formula:=C2*D2

'''NOTE''': This formal shows that this column is the value of the result the p_value multiplied by the rank. Again, it is important o t note that this is in ascending order so the smallest p value is multiplied by the smallest rank.

BH_Pvalue

Formula:=IF(E2>1,1,E2)

#The last worksheet we added was named "forGenMAPP". This was specifically named so that the program GenMapp can recognize and pull the data from this specific worksheet.

Columns copied and pasted from previous worksheets:ID,all consolidated columns for the biofilms and tobramycin, Avg_Biofilm_scaled_centered,Avg_Tobramycin_scaled_centered, Biofilm_Tobramycin_ratio, Pvalue, Bonferroni_Pvalue and BH_Pvalue.

'''NOTE''': The bonferroni and BH columns copied and pasted were the plain formula columns, ''not'' the IF statement columns.

There was only one new column added. In between the ID column and the first consolidate biofilm column, we added the column called "SystemCode". This was for formatting purposes for GenMAPP. Every cell in that column was designated a N.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol used in the Vibrio run we did in week 8.

[[File:Screenshoterrorwindow12052015.png|800px]]

[[File:Expandedtree20151205.png|800px]]

[[File:Rankedgeneontologyresults20151205.png|800px]]

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.
===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx]]

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-07T23:59:22Z

Vpachec3: /* Thursday,December 3 */ format fixing

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.
We named each biofilm column "consolidated_Biofilm_#_scaled_centered_4. There are 5 columns for biofilms.
We named each tobramycin column "consolidated-Tobramycin_#_scaled and centered_4. There are 3 columns for tobramycin.

This would all be on the first sheet named "quadruplicate_spots_separated"

*We then started a new worksheet and called it "statistics"
On this page we have the following information from the previous worksheet: ID column, MasterIndex Column, the consolidated biofilms columns and the consolidated tobramycin columns.
The additional columns added to the statistics worksheet are the following:

Average of the Consolidated biofilm columns

Formula:=AVERAGE(C2:G2)

Average of the Consolidated tobramycin columns

Formula:=AVERAGE(H2:J2)

Biofilm_Tobramycin_ratio column

Formula: =L2-K2

'''NOTE''': Our reference sample is genomic DNA which means that we need the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Subtracting the biofilm average from the tobramycin average works because the numbers are in log space.

Pvalue column

Formula:=TTEST(C2:G2,H2:J2,2,3)

'''NOTE''': The function used is a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples.

''FROM THIS POINT ON, ONLY SPECIAL PASTE. VALUES ONLY.''

* We then started a new worksheet and called it "boneferroni_pval". This is where we are going to see what the values using the Bonferroni adjustments.

Columns copied and pasted from previous sheet: ID, MasterIndex, Biofilm_Tobramycin_ration, and p_value.

The following are the new columns added onto this sheet and their respective formulas:

boneferroni_p_value

Formula: =D2*7252

'''NOTE''': This formula was applied only to the first cell in the column and we used the small black square at the bottom of the cell to have it apply accordingly to the rest of the cells in the column.

bonferroni_p_value

Formula:=IF(E2>1,1,E2)

'''NOTE''': This column is different from the previous column (although named the same way) because the function takes the values form the previous boneferroni_p_value and it makes all the values greater than 1 is marked "1" in this new column and values less than one can stay in their original value in the new column.

#Added another worksheet and named it "BH_pval". This worksheet we will have the Benjamin and Hochberg p value calculations.

Columns copied and pasted from the previous worksheet: ID, MasterIndex and p_value

Before adding any new columns, we organized the p_value column in ascending order.

We then added a "Rank" column and put a 1 in the first cell and a 2 in the second cell. The numbering was applied all the way down for the rest of the cells using the black box short cut. This is the start of the new columns to this worksheet, the rest of the news ones are as follows:

BH_p_value

Formula:=C2*D2

'''NOTE''': This formal shows that this column is the value of the result the p_value multiplied by the rank. Again, it is important o t note that this is in ascending order so the smallest p value is multiplied by the smallest rank.

BH_Pvalue

Formula:=IF(E2>1,1,E2)

#The last worksheet we added was named "forGenMAPP". This was specifically named so that the program GenMapp can recognize and pull the data from this specific worksheet.

Columns copied and pasted from previous worksheets:ID,all consolidated columns for the biofilms and tobramycin, Avg_Biofilm_scaled_centered,Avg_Tobramycin_scaled_centered, Biofilm_Tobramycin_ratio, Pvalue, Bonferroni_Pvalue and BH_Pvalue.

'''NOTE''': The bonferroni and BH columns copied and pasted were the plain formula columns, ''not'' the IF statement columns.

There was only one new column added. In between the ID column and the first consolidate biofilm column, we added the column called "SystemCode". This was for formatting purposes for GenMAPP. Every cell in that column was designated a N.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol used in the Vibrio run we did in week 8.

[[File:Screenshoterrorwindow12052015.png|800px]]

[[File:Expandedtree20151205.png|800px]]

[[File:Rankedgeneontologyresults20151205.png|800px]]

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.
===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx]]

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-07T23:57:59Z

Vpachec3: /* Thursday,December 3 */ added all protocol

== Thursday,December 3==
#Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.
We named each biofilm column "consolidated_Biofilm_#_scaled_centered_4. There are 5 columns for biofilms.
We named each tobramycin column "consolidated-Tobramycin_#_scaled and centered_4. There are 3 columns for tobramycin.

This would all be on the first sheet named "quadruplicate_spots_separated"

#We then started a new worksheet and called it "statistics"
On this page we have the following information from the previous worksheet: ID column, MasterIndex Column, the consolidated biofilms columns and the consolidated tobramycin columns.
The additional columns added to the statistics worksheet are the following:

Average of the Consolidated biofilm columns

Formula:=AVERAGE(C2:G2)

Average of the Consolidated tobramycin columns

Formula:=AVERAGE(H2:J2)

Biofilm_Tobramycin_ratio column

Formula: =L2-K2

'''NOTE''': Our reference sample is genomic DNA which means that we need the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Subtracting the biofilm average from the tobramycin average works because the numbers are in log space.

Pvalue column

Formula:=TTEST(C2:G2,H2:J2,2,3)

NOTE: The function used is a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples.

FROM THIS POINT ON, ONLY SPECIAL PASTE. VALUES ONLY.
# We then started a new worksheet and called it "boneferroni_pval". This is where we are going to see what the values using the Bonferroni adjustments.

Columns copied and pasted from previous sheet: ID, MasterIndex, Biofilm_Tobramycin_ration, and p_value.

The following are the new columns added onto this sheet and their respective formulas:

boneferroni_p_value

Formula: =D2*7252

'''NOTE''': This formula was applied only to the first cell in the column and we used the small black square at the bottom of the cell to have it apply accordingly to the rest of the cells in the column.

bonferroni_p_value

Formula:=IF(E2>1,1,E2)

'''NOTE''': This column is different from the previous column (although named the same way) because the function takes the values form the previous boneferroni_p_value and it makes all the values greater than 1 is marked "1" in this new column and values less than one can stay in their original value in the new column.

#Added another worksheet and named it "BH_pval". This worksheet we will have the Benjamin and Hochberg p value calculations.

Columns copied and pasted from the previous worksheet: ID, MasterIndex and p_value

Before adding any new columns, we organized the p_value column in ascending order.

We then added a "Rank" column and put a 1 in the first cell and a 2 in the second cell. The numbering was applied all the way down for the rest of the cells using the black box short cut. This is the start of the new columns to this worksheet, the rest of the news ones are as follows:

BH_p_value

Formula:=C2*D2

'''NOTE''': This formal shows that this column is the value of the result the p_value multiplied by the rank. Again, it is important o t note that this is in ascending order so the smallest p value is multiplied by the smallest rank.

BH_Pvalue

Formula:=IF(E2>1,1,E2)

#The last worksheet we added was named "forGenMAPP". This was specifically named so that the program GenMapp can recognize and pull the data from this specific worksheet.

Columns copied and pasted from previous worksheets:ID,all consolidated columns for the biofilms and tobramycin, Avg_Biofilm_scaled_centered,Avg_Tobramycin_scaled_centered, Biofilm_Tobramycin_ratio, Pvalue, Bonferroni_Pvalue and BH_Pvalue.

'''NOTE''': The bonferroni and BH columns copied and pasted were the plain formula columns, ''not'' the IF statement columns.

There was only one new column added. In between the ID column and the first consolidate biofilm column, we added the column called "SystemCode". This was for formatting purposes for GenMAPP. Every cell in that column was designated a N.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol used in the Vibrio run we did in week 8.

[[File:Screenshoterrorwindow12052015.png|800px]]

[[File:Expandedtree20151205.png|800px]]

[[File:Rankedgeneontologyresults20151205.png|800px]]

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.
===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx]]

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-07T23:10:35Z

Vpachec3: /* Thursday,December 3 */ Added more protocol

== Thursday,December 3==
#Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.
We named each biofilm column "consolidated_Biofilm_#_scaled_centered_4. There are 5 columns for biofilms.
We named each tobramycin column "consolidated-Tobramycin_#_scaled and centered_4. There are 3 columns for tobramycin.

This would all be on the first sheet named "quadruplicate_spots_separated"

#We then started a new worksheet and called it "statistics"
On this page we have the following information from the previous worksheet: ID column, MasterIndex Column, the consolidated biofilms columns and the consolidated tobramycin columns.
The additional columns added to the statistics worksheet are the following:

Average of the Consolidated biofilm columns
Formula:=AVERAGE(C2:G2)

Average of the Consolidated tobramycin columns
Formula:=AVERAGE(H2:J2)

Biofilm_Tobramycin_ratio column
Formula: =L2-K2

'''NOTE''': Our reference sample is genomic DNA which means that we need the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Subtracting the biofilm average from the tobramycin average works because the numbers are in log space.

Pvalue column
Formula:=TTEST(C2:G2,H2:J2,2,3)

NOTE: The function used is a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples.

**Then you can compute the Bonferroni and Benjamini and Hochberg corrected p values like you did in the Vibrio exercise.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol used in the Vibrio run we did in week 8.

[[File:Screenshoterrorwindow12052015.png|800px]]

[[File:Expandedtree20151205.png|800px]]

[[File:Rankedgeneontologyresults20151205.png|800px]]

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.
===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx]]

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-07T06:43:57Z

Vpachec3: /* Saturday, December 5 */ added screenshots

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
*Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
*Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.

**Then you need to average the averages for biofilm and for tobramycin. (It doesn't make sense to average biofilm and tobramycin together since they are separate treatments).
**Because your reference sample is genomic DNA and not RNA, you need to then take the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Because the numbers are in log space, you will subtract the biofilm average from the tobramycin average to get this number.
**You will conduct a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples using the TTEST function in Excel, not the equation we did for Vibrio. It will directly compute the p value.
**Then you can compute the Bonferroni and Benjamini and Hochberg corrected p values like you did in the Vibrio exercise.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol used in the Vibrio run we did in week 8.

[[File:Screenshoterrorwindow12052015.png|800px]]

[[File:Expandedtree20151205.png|800px]]

[[File:Rankedgeneontologyresults20151205.png|800px]]

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.
===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx]]

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

File:Screenshoterrorwindow12052015.png

2015-12-07T06:42:37Z

Vpachec3:

File:Expandedtree20151205.png

2015-12-07T06:41:35Z

Vpachec3:

File:Rankedgeneontologyresults20151205.png

2015-12-07T06:40:07Z

Vpachec3:

Vpachec3 Week 14

2015-12-07T05:45:01Z

Vpachec3: Format correction

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
*Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
*Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.

**Then you need to average the averages for biofilm and for tobramycin. (It doesn't make sense to average biofilm and tobramycin together since they are separate treatments).
**Because your reference sample is genomic DNA and not RNA, you need to then take the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Because the numbers are in log space, you will subtract the biofilm average from the tobramycin average to get this number.
**You will conduct a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples using the TTEST function in Excel, not the equation we did for Vibrio. It will directly compute the p value.
**Then you can compute the Bonferroni and Benjamini and Hochberg corrected p values like you did in the Vibrio exercise.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol used in the Vibrio run we did in week 8.

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.
===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx]]

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-07T05:39:05Z

Vpachec3: /* Sanity Check: Compare individual genes with known data */ added spreadsheet

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
*Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
*Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.

**Then you need to average the averages for biofilm and for tobramycin. (It doesn't make sense to average biofilm and tobramycin together since they are separate treatments).
**Because your reference sample is genomic DNA and not RNA, you need to then take the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Because the numbers are in log space, you will subtract the biofilm average from the tobramycin average to get this number.
**You will conduct a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples using the TTEST function in Excel, not the equation we did for Vibrio. It will directly compute the p value.
**Then you can compute the Bonferroni and Benjamini and Hochberg corrected p values like you did in the Vibrio exercise.

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol used in the Vibrio run we did in week 8.

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx]]

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-07T05:37:47Z

Vpachec3: /* Sanity Check: Compare individual genes with known data */

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
*Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
*Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.

**Then you need to average the averages for biofilm and for tobramycin. (It doesn't make sense to average biofilm and tobramycin together since they are separate treatments).
**Because your reference sample is genomic DNA and not RNA, you need to then take the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Because the numbers are in log space, you will subtract the biofilm average from the tobramycin average to get this number.
**You will conduct a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples using the TTEST function in Excel, not the equation we did for Vibrio. It will directly compute the p value.
**Then you can compute the Bonferroni and Benjamini and Hochberg corrected p values like you did in the Vibrio exercise.

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol used in the Vibrio run we did in week 8.

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

===Sanity Check: Compare individual genes with known data===
[[File:KWVP sanitycheck 20151206.xlsx|200px|thumb|left|Table of our data compared to the data reported in the Van Acker et al. paper. The significant values are colored red.]]

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

File:KWVP sanitycheck 20151206.xlsx

2015-12-07T05:34:21Z

Vpachec3:

Vpachec3 Week 14

2015-12-07T05:32:07Z

Vpachec3: /* Sanity Check: Number of genes significantly changed */ changing format of bullet points

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
*Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
*Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.

**Then you need to average the averages for biofilm and for tobramycin. (It doesn't make sense to average biofilm and tobramycin together since they are separate treatments).
**Because your reference sample is genomic DNA and not RNA, you need to then take the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Because the numbers are in log space, you will subtract the biofilm average from the tobramycin average to get this number.
**You will conduct a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples using the TTEST function in Excel, not the equation we did for Vibrio. It will directly compute the p value.
**Then you can compute the Bonferroni and Benjamini and Hochberg corrected p values like you did in the Vibrio exercise.

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol used in the Vibrio run we did in week 8.

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

*In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

*The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
**Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
*3279 genes which is 45%

Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
*3127 genes which is 43%

'''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
*1613 genes which is 22%

'''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
*1519 genes which is 21%

===Sanity Check: Compare individual genes with known data===

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-07T05:28:39Z

Vpachec3: added more added sections and deleted old info in second sanity check section

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
*Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
*Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.

**Then you need to average the averages for biofilm and for tobramycin. (It doesn't make sense to average biofilm and tobramycin together since they are separate treatments).
**Because your reference sample is genomic DNA and not RNA, you need to then take the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Because the numbers are in log space, you will subtract the biofilm average from the tobramycin average to get this number.
**You will conduct a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples using the TTEST function in Excel, not the equation we did for Vibrio. It will directly compute the p value.
**Then you can compute the Bonferroni and Benjamini and Hochberg corrected p values like you did in the Vibrio exercise.

==Saturday, December 5==
Kevin and I met up to run our data through genMAPP.
We took the .gdb file uploaded by Brandon, our QA, and followed the protocol used in the Vibrio run we did in week 8.

==Sunday, December 6==
Kevin and I met up to do our second sanity check. This is where we would compare the numbers we got from our data to the data reported in the paper.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

* In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

* The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
***3279 genes which is 45%

** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
***3127 genes which is 43%

** '''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
***1613 genes which is 22%

** '''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
***1519 genes which is 21%

===Sanity Check: Compare individual genes with known data===

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

GENialOMICS

2015-12-07T05:19:20Z

Vpachec3: /* Individual Goals and Progress */ added our progress

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Perform testing report on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
|
''Work with Kevin Wyllie''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|
''Work with Veronica Pacheco''
*Performed the statistical analysis in Excel.
*Formatted the gene expression data for import into GenMAPP.
*Imported data into GenMAPP, created ColorSets, and ran MAPPFinder.
*Documented and took screenshots on test runs with GenMAPP.
*Sent the EX.txt file to Brandon and also uploaded on wiki
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
| see Kevin's section -->
| [[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

=Deliverables=
=== Group Deliverables ===

* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

GENialOMICS

2015-12-07T04:40:42Z

Vpachec3: /* Deliverables */ Added list of deliverables we need

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
* Perform database export on the second build and other builds of the customized genMAPP builder
* Determine, with Anu, what modifications must be done to GenMAPP builder/Tallyengine
* Find out why most of the data/gene-names were not captured in the "OrderedLocusNames" table of the PSQL database for Export 1
* Perform testing report on any builds created for the week.
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
* Analyzed results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Began modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Finished third build.
* Customized Tally Engine to collect counts for ORF ID's.
* Finished fourth build.
|
|
|
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST) 
 
Build 3 - picking up gene names from ORF instead of ordered locus
* [[Media:Gmbuilder-genialomics-12032015-build-3.zip | gmbuilder-genialomics-12032015-build-3.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 14:58, 3 December 2015 (PST) 
 
Build 4 - customized TallyEngine
* [[Media:Gmbuilder-genialomics-12032015-build-4.zip|gmbuilder-genialomics-12032015-build-4.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:18, 3 December 2015 (PST)
|
*[[Media:Bc-Std_GEN_20151204.zip|Compressed gdb from Build 4]]
|
| [[media:For_genMAPP_KWVP20151205.EX.txt|Exceptions file from first export.]]
|}

==Other Progress==

=Deliverables=
=== Group Deliverables ===

* GenMAPP Gene Database for assigned species (''.gdb'')
* ReadMe file to accompany the Gene Database (''.pdf'')
** Sample ReadMe in Word format: [[Media:ReadMe_Vc-Std_External_20131122.zip | ReadMe_Vc-Std_External_20131122.zip]]
** [https://github.com/lmu-bioinformatics/xmlpipedb/blob/readme/GenMAPP%20Gene%20Databases/V.%20cholerae/V.%20cholerae%2020101022/ReadMe.md Sample ReadMe in markdown (a work in progress)]
** Include Gene Database Schema diagram in ReadMe
*** Sample schema in Adobe Illustrator format: [[Media:Vibrio_schema_20101022.zip | Vibrio_schema_20101022.zip]]
* Gene Database Testing Report for final submitted Gene Database (print from wiki to ''.pdf'' file)
* Processed and analyzed DNA microarray dataset (''.xls'' or ''.xlsx'')
* Data file used for import into GenMAPP (''.txt'' or ''.csv'')
* GenMAPP Expression Dataset file (''.gex'')
* Exceptions file of data imported into GenMAPP (''.EX.txt'')
* Raw MAPPFinder results files (''-GO.txt'')
* ''.gmf'' file
* Filtered MAPPFinder Results (''.xls'' or ''.xlsx'')
* Sample MAPP file of a relevant biological pathway for your species (''.mapp'')
* [[Gene Database Project Deliverables#Group Report | Group Report]] describing the creation of the Gene Database and the biological analysis of the data (''.doc'', ''.docx'', or ''.pdf'')
* PowerPoint presentation (''.ppt'', ''.pptx'', or ''.pdf'', given on Tuesday, December 15)

=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

Vpachec3 Week 14

2015-12-05T20:08:13Z

Vpachec3: /* Sanity Check: Number of genes significantly changed */ deleted last part of second sentence

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
*Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
*Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.

**Then you need to average the averages for biofilm and for tobramycin. (It doesn't make sense to average biofilm and tobramycin together since they are separate treatments).
**Because your reference sample is genomic DNA and not RNA, you need to then take the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Because the numbers are in log space, you will subtract the biofilm average from the tobramycin average to get this number.
**You will conduct a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples using the TTEST function in Excel, not the equation we did for Vibrio. It will directly compute the p value.
**Then you can compute the Bonferroni and Benjamini and Hochberg corrected p values like you did in the Vibrio exercise.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

* In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

* The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
***3279 genes which is 45%

** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
***3127 genes which is 43%

** '''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
***1613 genes which is 22%

** '''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
***1519 genes which is 21%

===Sanity Check: Compare individual genes with known data===

* Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. Look these genes up in your spreadsheet. '''''What are their fold changes and p values? Are they significantly changed in our analysis?'''''

'''VC0028'''

Fold Change:1.65, 1.27

P-Value: first entry = 0.0474, 0.0692

Significance: statistically significant, not statistically significant

'''VC0941'''

Fold Change:0.09, -0.28

P-Value: 0.6759, 0.1636

Significance:not statistically significant, not statistically significant

'''VC0869'''

Fold Change :1.59, 1.95, 2.20, 1.50, 2.12

P-Value:0.0463,0.0227,0.0020,0.0174,0.0200

Significance:significant,significant,significant,significant,significant

'''VC0051'''

Fold Change:1.92, 1.89

P-Value:0.0139,0.0160

Significance:statistically significant,statistically significant

'''VC0468'''

Fold Change: -0.17

P-Value: 0.3350

Significance: not statistically significant

'''VC2350'''

Fold Change: -2.40

P-Value: 0.0130

Significance: statistically significant

'''VCA0583'''

Fold Change: 1.06

P-Value: 0.1011

Significance: not statistically significant

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-05T20:06:27Z

Vpachec3: /* Sanity Check: Number of genes significantly changed */ Answered the remain questions in this section

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
*Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
*Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.

**Then you need to average the averages for biofilm and for tobramycin. (It doesn't make sense to average biofilm and tobramycin together since they are separate treatments).
**Because your reference sample is genomic DNA and not RNA, you need to then take the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Because the numbers are in log space, you will subtract the biofilm average from the tobramycin average to get this number.
**You will conduct a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples using the TTEST function in Excel, not the equation we did for Vibrio. It will directly compute the p value.
**Then you can compute the Bonferroni and Benjamini and Hochberg corrected p values like you did in the Vibrio exercise.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results of Merrell et al. (2002).

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

* In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

* The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''
***3279 genes which is 45%

** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''
***3127 genes which is 43%

** '''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
***1613 genes which is 22%

** '''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
***1519 genes which is 21%

===Sanity Check: Compare individual genes with known data===

* Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. Look these genes up in your spreadsheet. '''''What are their fold changes and p values? Are they significantly changed in our analysis?'''''

'''VC0028'''

Fold Change:1.65, 1.27

P-Value: first entry = 0.0474, 0.0692

Significance: statistically significant, not statistically significant

'''VC0941'''

Fold Change:0.09, -0.28

P-Value: 0.6759, 0.1636

Significance:not statistically significant, not statistically significant

'''VC0869'''

Fold Change :1.59, 1.95, 2.20, 1.50, 2.12

P-Value:0.0463,0.0227,0.0020,0.0174,0.0200

Significance:significant,significant,significant,significant,significant

'''VC0051'''

Fold Change:1.92, 1.89

P-Value:0.0139,0.0160

Significance:statistically significant,statistically significant

'''VC0468'''

Fold Change: -0.17

P-Value: 0.3350

Significance: not statistically significant

'''VC2350'''

Fold Change: -2.40

P-Value: 0.0130

Significance: statistically significant

'''VCA0583'''

Fold Change: 1.06

P-Value: 0.1011

Significance: not statistically significant

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-03T23:50:25Z

Vpachec3: /* Sanity Check: Number of genes significantly changed */ syntax

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
*Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
*Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.

**Then you need to average the averages for biofilm and for tobramycin. (It doesn't make sense to average biofilm and tobramycin together since they are separate treatments).
**Because your reference sample is genomic DNA and not RNA, you need to then take the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Because the numbers are in log space, you will subtract the biofilm average from the tobramycin average to get this number.
**You will conduct a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples using the TTEST function in Excel, not the equation we did for Vibrio. It will directly compute the p value.
**Then you can compute the Bonferroni and Benjamini and Hochberg corrected p values like you did in the Vibrio exercise.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results of Merrell et al. (2002).

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%

* In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

* The "biofilm_tobramycin_ratio" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero.
'''''How many are there? (and %)'''''

** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero.
'''''How many are there? (and %)'''''

** '''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''

** '''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)

===Sanity Check: Compare individual genes with known data===

* Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. Look these genes up in your spreadsheet. '''''What are their fold changes and p values? Are they significantly changed in our analysis?'''''

'''VC0028'''

Fold Change:1.65, 1.27

P-Value: first entry = 0.0474, 0.0692

Significance: statistically significant, not statistically significant

'''VC0941'''

Fold Change:0.09, -0.28

P-Value: 0.6759, 0.1636

Significance:not statistically significant, not statistically significant

'''VC0869'''

Fold Change :1.59, 1.95, 2.20, 1.50, 2.12

P-Value:0.0463,0.0227,0.0020,0.0174,0.0200

Significance:significant,significant,significant,significant,significant

'''VC0051'''

Fold Change:1.92, 1.89

P-Value:0.0139,0.0160

Significance:statistically significant,statistically significant

'''VC0468'''

Fold Change: -0.17

P-Value: 0.3350

Significance: not statistically significant

'''VC2350'''

Fold Change: -2.40

P-Value: 0.0130

Significance: statistically significant

'''VCA0583'''

Fold Change: 1.06

P-Value: 0.1011

Significance: not statistically significant

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-03T23:48:14Z

Vpachec3: /* Sanity Check: Number of genes significantly changed */ started adding values

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
*Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
*Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.

**Then you need to average the averages for biofilm and for tobramycin. (It doesn't make sense to average biofilm and tobramycin together since they are separate treatments).
**Because your reference sample is genomic DNA and not RNA, you need to then take the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Because the numbers are in log space, you will subtract the biofilm average from the tobramycin average to get this number.
**You will conduct a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples using the TTEST function in Excel, not the equation we did for Vibrio. It will directly compute the p value.
**Then you can compute the Bonferroni and Benjamini and Hochberg corrected p values like you did in the Vibrio exercise.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results of Merrell et al. (2002).

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 7251)?'''''
***4318 genes which is 60%
** '''''What about p < 0.01? and what is the percentage (out of 7251)?'''''
***2971 genes which is 41%
** '''''What about p < 0.001? and what is the percentage (out of 7251)?'''''
***1460 genes which is 20%
** '''''What about p < 0.0001? and what is the percentage (out of 7251)?'''''
***645 genes which is 9%

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 7251)?'''''
***179 genes which is 2.4%
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 7251)?'''''
***605 genes which is 8.3%
0 (0%) genes have a Bonferroni-corrected p value < 0.05

0 (0%) genes have a Benjamini and Hochberg-corrected p value < 0.05

* In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.

* The "Avg_LogFC_all" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change greater than zero. '''''How many are there? (and %)'''''

** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "biofilm_tobramycin_ratio" column to show all genes with an average log fold change less than zero. '''''How many are there? (and %)'''''

** '''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''

** '''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)

===Sanity Check: Compare individual genes with known data===

* Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. Look these genes up in your spreadsheet. '''''What are their fold changes and p values? Are they significantly changed in our analysis?'''''

'''VC0028'''

Fold Change:1.65, 1.27

P-Value: first entry = 0.0474, 0.0692

Significance: statistically significant, not statistically significant

'''VC0941'''

Fold Change:0.09, -0.28

P-Value: 0.6759, 0.1636

Significance:not statistically significant, not statistically significant

'''VC0869'''

Fold Change :1.59, 1.95, 2.20, 1.50, 2.12

P-Value:0.0463,0.0227,0.0020,0.0174,0.0200

Significance:significant,significant,significant,significant,significant

'''VC0051'''

Fold Change:1.92, 1.89

P-Value:0.0139,0.0160

Significance:statistically significant,statistically significant

'''VC0468'''

Fold Change: -0.17

P-Value: 0.3350

Significance: not statistically significant

'''VC2350'''

Fold Change: -2.40

P-Value: 0.0130

Significance: statistically significant

'''VCA0583'''

Fold Change: 1.06

P-Value: 0.1011

Significance: not statistically significant

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-03T23:30:05Z

Vpachec3: /* Thursday,December 3 */ added snaity checks

== Thursday,December 3==
*Kevin and I looked at Dr.Dahlquist's feedback on our spreadsheet so far.
*Based on what was said we used the file Raw_compiled_data_KD20151124.xls to continue the process.This was because Dr. Dahlquist made the point that our chip has each gene spotted in quadruplicate. These are considered technical replicates and they should be averaged before doing any further analysis.
*Thus, we took the average of the 4 technical replicates with the average function on excel.We selected the gene for each replicate individually. The function looked like

=AVERAGE(C2,M2,W2,AG2)

We then clicked on the small black square at the bottom of the cell to have the function repeat and adjust for the remaining cells.

**Then you need to average the averages for biofilm and for tobramycin. (It doesn't make sense to average biofilm and tobramycin together since they are separate treatments).
**Because your reference sample is genomic DNA and not RNA, you need to then take the ratio of the averages for the biofilm and tobramycin samples to get the ratio of tobramycin to control (tobramycin over biofilm). Because the numbers are in log space, you will subtract the biofilm average from the tobramycin average to get this number.
**You will conduct a two-sample t test comparing the 5 biofilm samples to the 3 tobramycin samples using the TTEST function in Excel, not the equation we did for Vibrio. It will directly compute the p value.
**Then you can compute the Bonferroni and Benjamini and Hochberg corrected p values like you did in the Vibrio exercise.

===Sanity Check: Number of genes significantly changed===

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results of Merrell et al. (2002).

* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** '''''How many genes have p value < 0.05? and what is the percentage (out of 5221)?'''''
** '''''What about p < 0.01? and what is the percentage (out of 5221)?'''''
** '''''What about p < 0.001? and what is the percentage (out of 5221)?'''''
** '''''What about p < 0.0001? and what is the percentage (out of 5221)?'''''

948 (18.2%) genes have a p value < 0.05

235 (4.5%) genes have a p value < 0.01

24 (0.5%) genes have a p value < 0.001

0 (0%) genes have a p value < 0.0001

* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5221 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 261 times. (Test your understanding: [http://xkcd.com/882/ http://xkcd.com/882/].) Since we have more than 261 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know ''which'' ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
** '''''How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 5221)?'''''
** '''''How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 5221)?'''''

0 (0%) genes have a Bonferroni-corrected p value < 0.05

0 (0%) genes have a Benjamini and Hochberg-corrected p value < 0.05

* In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.
* The "Avg_LogFC_all" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. '''''How many are there? (and %)'''''
** Keeping the (unadjusted) "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero. '''''How many are there? (and %)'''''
** '''''What about an average log fold change of > 0.25 and p < 0.05? (and %)'''''
** '''''Or an average log fold change of < -0.25 and p < 0.05? (and %)''''' (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)

352 (6.7%) genes have a positive average log fold change

596 (11.4%) genes have a negative average log fold change

339 (6.4%) genes have an significant positive average log fold change

578 (11.1%) genes have an significant negative average log fold change

===Sanity Check: Compare individual genes with known data===

* Merrell et al. (2002) report that genes with IDs: VC0028, VC0941, VC0869, VC0051, VC0647, VC0468, VC2350, and VCA0583 were all significantly changed in their data. Look these genes up in your spreadsheet. '''''What are their fold changes and p values? Are they significantly changed in our analysis?'''''

'''VC0028'''

Fold Change:1.65, 1.27

P-Value: first entry = 0.0474, 0.0692

Significance: statistically significant, not statistically significant

'''VC0941'''

Fold Change:0.09, -0.28

P-Value: 0.6759, 0.1636

Significance:not statistically significant, not statistically significant

'''VC0869'''

Fold Change :1.59, 1.95, 2.20, 1.50, 2.12

P-Value:0.0463,0.0227,0.0020,0.0174,0.0200

Significance:significant,significant,significant,significant,significant

'''VC0051'''

Fold Change:1.92, 1.89

P-Value:0.0139,0.0160

Significance:statistically significant,statistically significant

'''VC0468'''

Fold Change: -0.17

P-Value: 0.3350

Significance: not statistically significant

'''VC2350'''

Fold Change: -2.40

P-Value: 0.0130

Significance: statistically significant

'''VCA0583'''

Fold Change: 1.06

P-Value: 0.1011

Significance: not statistically significant

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 14

2015-12-03T23:26:09Z

Vpachec3: working on electronic notebook

GENialOMICS

2015-12-03T22:37:05Z

Vpachec3: /* Individual Goals and Progress */ updated goals for this week for genmapp users

[[Image: Genialomics-banner.jpg | center | 1055px]]
 
{{Template:GÉNialOMICS}}
 
=Week 10=
*On November 3, 2015, GÉNialOMICS chose to create database for ''Burkholderia cenocepacia''.
*On November 5, 2015, a journal detailing the complete genome of ''Burkholderia cenocepacia'' (strain J2315) titled [http://jb.asm.org/content/191/1/261.short The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients] was found.

 
=Week 11=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete journal club individual assignment
* Create and practice journal club presentation with Brandon
* Find a MOD
* Create project timeline with soft deadlines for each person/milestone
* Complete milestone 0: Working Environment Setup
* Complete milestone 1: Version Control Setup
* Begin milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Reformat Home Page with Dr. Dahlquists recommendations
|
*Perform an initial import/export cycle (with Anu)
*Figure out a file management system (with Anu)
*Characterize regular expression patterns for ID detection
*Further explore the found MOD and review it
*Complete Journal Club presentation on the genome paper
|
''Work with Kevin Wyllie''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|
''Work with Veronica Pacheco''
* Understand experimental design
* Understand sample-data relationship
** raw.zip and .sdrf
** Construct sample-data diagram
* Develop compiled raw data file
|-
!scope="row"|'''Progress'''
|
*Found a possible [http://www.burkholderia.com/ model organism database] (with Brandon Litvak)
*The journal club presentation and outline was prepared; MOD was examined and reviewed
* Completed journal club individual assignment
* Created and practice journal club presentation with Brandon
* Found a MOD
* Created project timeline with soft deadlines for each person/milestone
* Completed milestone 0: Working Environment Setup
* Reformatted Home Page with Dr. Dahlquists recommendations
* Journal Club Presentation File: [[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf|Genome Presentation]]
|
* Found a possible [http://www.burkholderia.com/ model organism database].
**The more recent [http://beta.burkholderia.com/ version] of the MOD was found
*The regular expression patterns for J2315 were determined
*Preparation was done for the Genome Paper Presentation
**Completed outline of the genome paper
*File management system was determined
|
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper
*[[Vpachec3 Week 11]]
| Created methods diagram ([[media:KWVPMethoddiagram.jpg]]).
*[[File:B. Cenopacia.pptx]]
* we understood the experimental design
* made chart for microarray experiment
*finished powerpoint on microarray paper

|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 11]]
|
*[[Blitvak Week 11]]
|
*[[Vpachec3 Week 11]]
|
*[[Kevin Wyllie Week 11]]
|}

==File Management System==
*Files utilized in weekly projects will renamed as follows: XXXX_GEN_(Initials)(Week#)_yyyymmdd, where "XXXX" is the original filename. If multiple versions of the same file (with identical filenames) are used on the same day then a (positive integer) (starting from 1) will be added to any additional versions (e.g. XXXX_GEN_BL11_yyyymmdd, XXXX_GEN_BL11_yyyymmdd(1), XXXX_GEN_BL11_yyyymmdd(2), for three different versions of the same file uploaded by Brandon Litvak during Week 11)
*Files will be uploaded to the weekly progress table under a file row with a clear label (under the respective group member that created them/used them)
*All original unmodified files will be saved and will also be uploaded, together, as a compressed zip with the filename: ORIG_GEN_(Initials)(Week#); the compressed zip containing all original files will be the last entry in the row designated for the files, with the label "ORIGINAL FILES"
==Other Progress==
* A journal using the ''B. cenocepacia'' genome in a microarray experiment titled [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/ Biofilm-Grown ''Burkholderia cepacia'' Complex Cells Survive Antibiotic Treatment by Avoiding Production of Reactive Oxygen Species] was approved for analysis.

==Journal Club Presentations==
[[Media:Genome_Presentation_-_Anu_and_Brandon_-_Genialomics(FIXED).pdf| Genome Paper Presentation Week 11]] 
[[Media:B._Cenopacia.pptx| Microarray Paper Presentation Week 12]]
 

=Week 12=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Complete milestone 0: Working Environment Setup
** Set up the development machine (my laptop) with all required software for coding.
* Complete milestone 1: Version Control Setup
** Set up a branch specific to our project and clone necessary code from GitHub onto the development machine.
* Complete milestone 2: “Developer Rig” Setup and Initial As-Is Build
** Confirm that all core software for developing, building, and testing prototype version of GenMAPP Builder are on the development machine.
** Set up Eclipse and java project workspace.
** Run initial build.
|
*Complete Milestone 1: Initial Database Export
*Create a Gene Database testing report for the initial export
*Further explore the various ID systems; verify previous findings
*Create expressions for Match/PGSQL that will assist in evaluating the quality of any exported databases
|
''Work with Kevin Wyllie''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|
''Work with Veronica Pacheco''
*Read the microarray paper to understand the experiment.
*Create a table or list that shows the correspondence between the samples in the experiment and the files you have downloaded.
*Determine how many biological or technical replicates, and which samples were labeled with Cy3 or Cy5.
*Create a Master Raw Data file that contains the IDs and columns of data required for further analysis.
*Consult with Dr. Dahlquist on how to process the data (normalization, statistics).
|-
!scope="row"|'''Progress'''
|
* Completed milestone 0: Working Environment Setup
* Completed milestone 1: Version Control Setup
* Completed milestone 2: “Developer Rig” Setup and Initial As-Is Build
* Began milestone 3: Species Profile Creation
** Need to consult with Brandon and/or Dr. Dionisio before continuing.
|
*Completed Milestone 1: Initial Database Export
*Completed Milestone 2: ID Pattern Definition and Verification (will be revisited for future work with modified forms of GenMAPP builder)
**Should talk to Anu about further steps involving TallyEngine/GenMAPP builder (organize more exports)
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|
* Compiled data into one Excel spreadsheet.
* Centered data and began statistical analysis (stopped at T statistic).
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 12]]
|
*[[Blitvak Week 12]]
|
*[[Vpachec3 Week 12]]
|
*[[Kevin Wyllie Week 12]]
|-
!scope="row"|'''Files Used/Created'''
|
|
*[[Media:WEEK12FILES GEN BL.zip|Week 12 Files]]
*[[Media:ORIG GEN BL12.zip|Starting Files]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|
[[Media:Raw_compiled_data_KW20151119.xlsx]]
|}

==Other Progress==

=Week 14=
==Individual Goals and Progress==
{| class="wikitable" style="margin: auto;"
!colspan="9"|Weekly Goals and Progress
|-
!
![[User:Anuvarsh|'''Anu Varshneya''']]
![[User:Blitvak|'''Brandon Litvak''']]
![[User:Vpachec3|'''Veronica Pacheco''']]
![[User:Kwyllie|'''Kevin Wyllie''']]
|-
!scope="row"|'''Goals'''
|
* Finish second build.
* Analyze results from previous build with Brandon Litvak and determine modifications that need to be made to code.
* Begin modifying code to collect gene names from "ORF" instead of "OrderedLocusTags"
* Start writing README and scientific paper (parts of deliverables).
|
|
''Work with Kevin Wyllie''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|
''Work with Veronica Pacheco''
*Perform the statistical analysis in Excel.
*Format the gene expression data for import into GenMAPP.
*Import data into GenMAPP, create ColorSets, and run MAPPFinder.
*Document and take notes on test runs with GenMAPP.
*Use the EX.txt file to help the Coder/Quality Assurance team members to validate the .gdb.
*Do a journal club outline of the paper so that you can use it in the Discussion section of your group report and your final presentation.
Create a .mapp file showing one pathway that is changed in your data.
|-
!scope="row"|'''Progress'''
|
* Finished second build.
|
|
|
|-
!scope="row"|'''Individual Journal Pages'''
|
*[[Anuvarsh Week 14]]
|
*[[Blitvak Week 14]]
|
*[[Vpachec3 Week 14]]
|
*[[Kevin Wyllie Week 14]]
|-
!scope="row"|'''Files Used/Created'''
|
Build 2 - with customized species profile:
* [[Media: Gmbuilder-genialomics-12012015-build-2.zip| gmbuilder-genialomics-12012015-build-2.zip]]
[[User:Anuvarsh|Anuvarsh]] ([[User talk:Anuvarsh|talk]]) 15:22, 1 December 2015 (PST)
|
|
|
|}

==Other Progress==

=''Burkholderia cenocepacia'' Genome Paper=
'''Holden, M. T. G., Seth-Smith, H. M. B., Crossman, L. C., Sebaihia, M., Bentley, S. D., Cerdeño-Tárraga, A. M., … Parkhill, J. (2009). The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients . Journal of Bacteriology, 191(1), 261–277. http://doi.org/10.1128/JB.01230-08'''
* The link to the abstract from [http://www.ncbi.nlm.nih.gov/pubmed PubMed]. [http://www.ncbi.nlm.nih.gov/pubmed/18931103]
* The link to the full text of the article in [http://www.ncbi.nlm.nih.gov/pmc/ PubMedCentral]. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2612433/]
* The link to the full text of the article (HTML format) from the publisher web site. [http://jb.asm.org/content/191/1/261.long]
* The link to the full PDF version of the article from the publisher web site. [http://jb.asm.org/content/191/1/261.full.pdf+html]
* Who owns the rights to the article? '''American Society for Microbiology'''
** Does the journal own the copyright? '''Yes'''
** Do the authors own the copyright? '''No'''
** Do the authors own the rights under a [http://creativecommons.org/ Creative Commons] license? '''No'''
** Is the article available “Open Access”? '''Yes'''
* What organization is the publisher of the article? What type of organization is it? (commercial, for-profit publisher, scientific society, respected open access organization like [http://www.plos.org/ Public Library of Science] or [http://www.plos.org/ BioMedCentral], or predatory open access organization, see the list of) [http://oaspa.org/membership/members/ (Open Access Scholarly Publishers Association Members) here.] '''American Society for Microbiology which is a scientific society'''
* Is this article available in print or online only? '''It is both available in print and online.'''
* Has LMU paid a subscription or other fee for your access to this article? '''Well I first looked at this article through web of science which LMU does pay for but looking at the article through PubMed, PubMed central and the publisher website was free.'''
* How many articles does this article cite? '''It has 150 cited references.'''
* How many articles cite this article? '''It is cited 128 times.'''
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? '''A lot of the papers revolved around antibiotic resistance and therapeutic strategies.'''

 
=Microarray Paper=
'''Van Acker, H., Sass, A., Bazzini, S., De Roy, K., Udine, C., Messiaen, T., ... & Coenye, T. (2013). Biofilm-grown Burkholderia cepacia complex cells survive antibiotic treatment by avoiding production of reactive oxygen species. ''PLoS One'', 8(3), e58943.'''
* This article is suitable for your project. ''— [[User:Kdahlquist|Kdahlquist]] ([[User talk:Kdahlquist|talk]]) 10:17, 10 November 2015 (PST)''
* The link to the abstract from PubMed: http://www.ncbi.nlm.nih.gov/pubmed/?term=Biofilm-Grown+Burkholderia+cepacia+Complex+Cells+Survive+Antibiotic+Treatment+by+Avoiding+Production+of+Reactive+Oxygen+Species
* The link to the full text of the article in PubMedCentral: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3596321/
* The link to the full text of the article (HTML format) from the publisher web site: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058943
**Cannot find HTML format on publisher web site.
* The link to the full PDF version of the article from the publisher web site: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0058943&representation=PDF
* Who owns the rights to the article? Authors of the article: Heleen Van Acker, Andrea Sass, Silvia Bazzini, Karen De Roy, Claudia Udine, Thomas Messiaen, Giovanna Riccardi, Nico Boon, Hans J. Nelis, Eshwar Mahenthiralingam, Tom Coenye
* Does the journal own the copyright? Yes.
* Do the authors own the copyright? No.
* Do the authors own the rights under a Creative Commons license? Yes.
* Is the article available “Open Access”? Yes.
* What organization is the publisher of the article? What type of organization is it? Public Library of Science, Professional OA Publisher, Member of Open Access Scholarly Publishers Association
* Is this article available in print or online only? Available in print and online.
* Has LMU paid a subscription or other fee for your access to this article? No.
* Where does MicroArray Data reside? https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-3532/?keywords=&organism=Burkholderia+cenocepacia&exptype%5B%5D=%22rna+assay%22&exptype%5B%5D=%22array+assay%22&array=
* What experiment was performed? What was the "treatment" and what was the "control" in the experiment? The experiment hoped to test whether persister cells are present in Burkholderia cepacia complex (Bcc) biofilms, what the molecular basis of antimicrobial tolerance in Bcc persisters is, and how persisters can be eradicated from Bcc biofilms. Burkholderia cenocepacia biofilms were treated with 1024 µg/ml of tobramycin in the treatment group. The control group did not receive any tobramycin.
* Were replicate experiments of the "treatment" and "control" conditions conducted? Were these biological or technical replicates? How many of each? 2 technical replicates were made across 5 biological replicates for the control, and 2 technical replicates of 3 biological replicates of the treatments.
* How many articles does this article cite? This article has 34 cited references.
* How many articles cite this article? This article is cited 17 times in All Databases, and 17 time in Web of Science Core Collection.
* Based on the titles and abstracts of the papers, what type of research directions have been taken now that the genome for that organism has been sequenced? Most of the articles are related to antimicrobial therapy, tolerance, and resistance.

Vpachec3 Week 14

2015-12-01T23:10:48Z

Vpachec3:

in excel, use ttest function, type is 3
write down the number of exception in the gdb
==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]

[[Category: Journal Entry]]

Vpachec3 Week 12

2015-11-24T05:42:11Z

Vpachec3: /* Links */ added GÉNialOMICS links section

==Thursday, November 19==
*Met with Dr. Dahlquist before class to talk about the paper
**Took note of the microarray data versus the qPCR results. Some results were off and will keep that in mind during our analysis
*GenMapp users with Dr.Dahlquist during class to go over progress
**Rough summary of experiment design:
**5 untreated biofilms vs. genomic DNA
**3 tobramycin-treated biofilm vs. genomic DNA
*Noticed that our sdrf file was imputed wrong
*Need to ask authors what they did with the dye swaps
*Started computing master raw data
**For each text file(8), we are copying rows M(gene name) and R (log ratio) into one data sheet
***Decided to make names side by side for our own organization style
***Made three sheets in the file:allgenename_logratio, genename_logratio, clean_genename_logratio

==Monday, November 23==
*Kevin and I met in the computer lab at 6:00pm.
*We worked on our presentation for tomorrow and we also worked on the master data sheet.
**We had some decent headway on Thursday so we picked it up from there.
*Credit is due to [[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/Kevin_Wyllie_Week_12| Kevin]] for helping going through the process and writing the procedure.
==== Compiling Data ====

#Columns M ("GeneName") and R ("LogRatio") were copied and pasted into a new Excel file as stated in the Thursday section.
#*GeneName columns were pasted next to one another (columns A-H).
#*All LogRatio columns were pasted after the GeneName columns (columns I-P).
#*A header was added at the top of each column, with each corresponding file name for example (125_2, for example) four our own organization.
# The GeneName columns were scanned for any discrepancies n terms of amount of rows or ordering of gene ID's. The LogRatio columns were scanned for discrepancies in amount of rows.
# This sheet was named "allgenename_logratio".
#New sheet called "genename_logratio"
#*Only one GeneName column was used for this sheet (column A), as all of the columns had been confirmed to be identical between files.
#*LogRatio data for each file was pasted into columns B-I, maintaining the previously mentioned file name headers. The data was pasted in the order of the file name numbers, from smallest to largest.
#* ''Note: Rows not containing genes existing in the ''B. cenocepacia'' genome must be removed. In theory, this is very easy to do. Search results for examples of the several gene ID formats in burkholderia.com (set to ''Burkholderia cenocepacia'' J2314) suggest that the formats used are those that start with "BCAS," "BCAM," "BCAL" and "pBCA." However, we were not sure, logistically, how to apply a filter in Excel which would select for these gene ID's, primarily because you cannot apply two "Begins with" filters simultaneously (required to include the "pBCA" genes). However, filtering for both groups separately yielded the same combined amount of rows as filtering for "Contains: BC" (29004 rows - there appear to be quadruplets for every gene). So we will use this filter, for now, although this may not be considered optimal.
# New sheet called "compiled_raw_data".
#* All content from the "genename_logratio" sheet was pasted into the "compiled_raw_data" sheet.
#* Finally, a new row was inserted under the header row. These row was titled "ExpName". The purpose of this row is to indicate what kind of cells were used for the corresponding experiment. Cells B2-F2 contain "Biofilm" as these columns correspond to experiments using biofilm cells that were ''not'' treated with tobramycin (125_1, 125_2, 125_3, 125_4, 126_1). Cells G2-I2 contained "Tobramycin" as these columns correspond to experiments using biofilm cells that were treated with tobramycin (126_2, 126_3, 126_4).

==== Normalization ====

# New Sheet called "scaled_centered". All data from "compiled_raw_data" was pasted into these new sheet.
# Above the gene names, two new rows were inserted. Cell A4 read "Average" and A5 read "StdDev". In cell B4, the following command was entered: <code>=AVERAGE(B6:B29009)</code>. In cell B5, the command was <code>=STDEV(B6:B29009)</code>. These codes were then pasted into columns C-I using the drag feature (which adjusts for the column change), so that the averages and standard deviations of log ratios for each sample were calculated.
# All headers were then repeated, in the same order, to the right of the existing headers. Each ExpName was edited, adding "_scaled_centered" to the end of each heading.
# The scaled/centered log ratios were then computed.
#* The following command was entered into cell J6: <code>=(B6-B$4)/B$5</code>. This takes the first log ratio for sample Biofilm_1, subtracts the average log ratio, and divides by the standard deviation of the log ratios. This function was then pasted for the remaining data, using the drag feature. The placement of the <code>$</code>'s ensures that the cell references for the averages and standard deviations ''will'' change horizontally (to account for different averages/stdev's between samples), but ''will not'' change vertically.

==== Statistical Analysis ====

# A new sheet was created, named "statistics."
# The gene ID columnn (column A) along with all previously created "_scaled_centered" data were pasted into the "statistics" sheet.
# The "Average" and "StdDev" rows were deleted.
# Also, the GeneName row (not column) and FileName row (not column) were deleted. These rows are no longer necessary. "GeneName" was entered into cell A1.
# In cell J1, "Avg_LogFC_Biofilm" was entered, and in cell K1, "Avg_LogFC_Tobramycin" was entered.
# In cell J2, the following command was entered: <code>=AVERAGE(B2:F2)</code>. This takes the average of the centered log ratios for the gene corresponding to that row, across all five biofilm samples. This formula was pasted down the entire column.
# Similarly, in cell K2, <code>=AVERAGE(G2:I2)</code>. This finds the same average, but across all three tobramycin treated biofilm samples. This formula was pasted down the entire column.
# The header "Avg_LogFC_All" was added to cell L1. In cell L2, the average of the two previously computed averages was found using <code>=AVERAGE(J2:K2)</code>. Again, this was pasted down the column.
# The next column (M) was named "Tstat".
# ''Note: We are not sure how to move on, as computing the T statistic calls for the number of replicates, however, the number of replicates between treated and untreated samples are different (3 and 5, respectively). We will stop here and consult professors regarding how to continue with statistical analysis.''

==Links==
[https://xmlpipedb.cs.lmu.edu/biodb/fall2015/index.php/User:Vpachec3 Vpachec3 User Page]
{{Template:Vpachec3 journal links}}

==GÉNialOMICS Links==
{{Template:GÉNialOMICS}}
[[Category: Journal Entry]]