Class Journal Week 8
Contents
Zachary Van Ysseldyk's Responses
- The main issues with the data and analysis identified was that the data was not reproducible - actually, that the data was very very far from reproducible. They reused the same genes in statistical analysis, their indexing was off by one, and some of their verification tables (namely the 59 gene ovarian cancer model) matched 0% of what the line was supposed to be. Based on his overall observations, he says the most common mistakes are simple. He says how this simplicity is often hidden, and furthers to say that the most simple mistakes are common. Specifically, he found the most common mistakes concerning: Mixing up sample labels, Mixing up the gene labels, Mixing up the group labels, and incomplete documentation. He notes how the MOST common mistake is the complete confounding in the Experimental design. Many of these points Dr. Baggerly expresses have been brought up when looking at DataOne. For one, not all of the labels are clear. Furthermore, the workflow is not easily reproducible.
- Baggerly first suggests that the data should be labeled in order to clearly be able to tell which data is which.The biggest thing that he expresses, of course, is the reproducibility of the workflow. All of the suggestions Beggarly expresses basically points towards having the data be reproducible. DataOne also stresses and strongly advocated the essential practice for proper data documentation.
- The main best practice that we performed was the reproducibility of the data as outlined on the individual work page. We were able to cater the instructions to our specific gene so that the analyses could be easily reproduced. We also made sure that all of the genes were labeled. Putting in summaries and electronic workbooks helps the user to have an overview of the project which enables them to have a clear objective going into the project.
- It seemed like the press and organizations didn't take him that seriously at first. Beggarly didn't seem to upset about it during his lecture, but I would be a little angry having gone through that much work just to have been brushed off. Although I am not looking to go into a biology rated workplace, his enthusiasm about the subject was inspiring. Also I liked his quirkiness.
Zvanysse (talk) 19:40, 23 October 2017 (PDT)
Zvanysse
BIOL/CMSI 367-01: Biological Databases Fall 2017
Assignments
Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | Week 10 | Week 11 | Week 12 | Week 14
Individual Assignments
Zvanysse Week 1 | Zvanysse Week 2 | Zvanysse Week 3 | Zvanysse Week 4 | Zvanysse Week 5 | Zvanysse Week 6 | Zvanysse Week 7 | Zvanysse Week 8 | Zvanysse Week 9 | Zvanysse Week 10 | Zvanysse Week 11 | Zvanysse Week 12 | Zvanysse Week 14 | Zvanysse Week 15
Zvanysse Week 1 Journal | Zvanysse Week 2 Journal | Zvanysse Week 3 Journal | Zvanysse Week 4 Journal | Zvanysse Week 5 Journal | Zvanysse Week 6 Journal | Zvanysse Week 7 Journal | Zvanysse Week 8 Journal | Zvanysse Week 9 Journal | Zvanysse Week 10 Journal | Zvanysse Week 11 Journal | Zvanysse Week 12 Journal | Zvanysse Week 14 Journal
QLanners Responses
- There were a number of issues with the data and analysis identified by Baggerly and Coombs. Some of the main issues included a universal off-by-one indexing error brought about by poor attention to the software being used, an inaccurate use of secndary source data (their data labels seemed to be flipped from the published data labels), the use of duplicate data, and very poor documentation in general. The review panel even said that they could not figure out from the published data how to reproducce the work without some sort of outside help. A number of best practices enumarted by DataOne were broken, including a failure to maintain dataset provenance, a lack of documentation of all assumptions, a lack of any form of repdocuible workflow documentation, and very poor labeling techniques. Dr. Baggerly claimed that several of these were common mistakes, most prominently the off-by-one indexing error and the mixing-up of labels. Dr. Baggerly also pointed out how it is the poor documentation that often leads to these easy mistakes going undetected.
- Dr. Baggerly recommends more thorough documentation of the data, namely through labels for all published data. He also recommends a stricter requirement for data provenance and for code to be published along with the data. Overall, Dr. Baggerly stresses the need for the research to be reproducible. The corresponds very closely with what DataOne recommends, as several of the best practices (as outlined above) are essential for properly documenting data and data analysis and ensuring that someone else can perform the exact same steps on the data in the future using just the documentation.
- In this weeks assignment we performed a number of best practices. We ensured to provide distinct labels for all of our data points, we labeled our data files as descriptive names, we appropriately handled missing data, and we kept documentation on how we performed our data analysis so that it could be reproduced in the future by somebody else.
- I was very surprised at all of the pushback that Dr. Baggerly received from the scientific journals when he shared the errors in the data. I would have thought that scientific journals would have been much more committed to ensuring that the papers that they had published were accurate and would have been more helpful to Dr. Baggerly and tough on the Duke research team. I think going forward a higher sense of accountability needs to be adopted in the scientific field to avoid scenarios like this.
Qlanners (talk) 17:28, 22 October 2017 (PDT)
QLanners Links
Main Page
User Page
Assignment Pages: Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | Week 10 | Week 11 | Week 12 | Week 14 | Week 15
Journal Entry Pages: Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | Week 10 | Week 11 | Week 12 | Week 14 | Week 15
Shared Journal Pages: Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | Week 10
Group Project Page: JASPAR the Friendly Ghost
Mary Balducci's Responses
- The main issue with the data and analysis was that the results were not reproducible. There were errors with mislabeling data, as well as indexing errors. The data did not make sense, and their methods were not clear without outside help. Simple were also easy to miss due to poor documentation and recording of data. Dr. Baggerly claims that the most common mistakes are mixing up sample labels, mixing up gene labels, mixing up group labels, and incomplete documentation.
- Dr. Baggerly recommends labelling table columns, having a code, describing steps, especially steps which are planned in advance. These are very similar to DataOne's recommendations of labelling and having good documentations.
- Best practices that I performed for this week were labelling my data points with headers that explain exactly what the data point is. I also kept a detailed outline of every step I took to get my results.
- My reaction to this case after viewing the video is that I'm still shocked that this went on for so long. It seems like it was very obvious that the data was not reliable and yet it was allowed to go as far as clinical trials.
Mbalducc (talk) 20:44, 22 October 2017 (PDT)
Eddie Azinge's Responses
- The most prevalent issue with the data was that Baggerly and Coombs weren't able to reproduce the data themselves, given a plethora of issues from the original data and analysis. Simple errors such as off-by-one errors, indexing errors, use of duplicate data, poor documentation, as well as mixing up sample and gene labels all aggregated to create a dataset that was irreproducible by reasonable methods. This was augmented by the fact that the lab was not always following best practices, specifically those set forth by DataOne.
- Dr Baggerly recommends a more rigorous and strict adherence to following proper protocol, such as following conventions for labeling data, heavy documentation of processes, and most importantly having a reproducible workflow. DataOne echoes most of these points, specifically emphasizing reproducibility of experiments and proper documentation.
- This week, we adhered to best practices by documenting our process as we followed the assignment, practicing proper labeling conventions, consistently dealing with missing data, and ensuring that our results were reproducible by other students.
- Learning more about this case makes me understand just how vast this field of biology is. This whole fiasco at Duke, if not properly taken care of, potentially stood to earn people a vast amount of money off of illegitimate practices and false hopes. It really emphasizes how important sticking to best practices is in order to prevent our analyses from causing harm to the society at large.
Cazinge (talk) 20:13, 23 October 2017 (PDT)
Katie Wright's Response
- Baggerly and Coombs were not able to reproduce the data they analyzed after pouring over the data and using every available means of "Forensic Bioinformatics." And this was not because of one specific error, but because of a multitude of errors. Often times, the data was not labeled correctly or software was used incorrectly that lead to the mislabeling of data (the +1 problem). The best practices violated were consistency in data labeling and documentation. Mislabeling and misdocumentation are some of the most common errors in data analysis, and wouldn't be such an enormous problem if they were just done properly in the first place.
- Dr. Baggerly and DataONE both reccommend creating "reproducible workflow." Your process and reasoning should be transparent and understandable so it can be evaluated/critiqued by others. Processes should also be automated wherever possible.
- For this week we
- Documented entire procedure in minute detail (thanks to procedure provided by professors in week 8 assignment)
- formatted Excel spreadsheet with no spaces between rows or columns, and new worksheets were created often to provide a step-by-step look at how the dataset was analyzed/manipulated.
- I think this talk just made me more angry about the whole fiasco. There were multiple "disturbing" errors (as Dr. Baggerly called them) that were pointed out to journals time after time. It took so long for the scientific community to listen to the biostatisticians and take an in-depth look at the data. I think that Dr. Baggerly makes a very good suggestion when he says that every institution should have their own biostatisticians independently review and reproduce the data analysis for every experiment before it is published.
Kwrigh35 (talk) 14:28, 23 October 2017 (PDT)
Corinne Wong's Response
- There was inconsistent data, and the research was not reproducible. Baggerly and Coombs frequently found errors, discrepancies, or missing information in the data that were never fully corrected. DataONE’s best practices that were violated were inconsistent and missing data. Some of the issues that were common were standard input errors: mixing up the sample labels, gene labels, and group labels.
- Dr. Baggerly recommends to provide the data, code, and have clear labels, which relates to how DataOne says to have accessible and organized data. His recommendations of clear documentation of corrections, assumptions, and errors also correspond to DataONE’s recommendations.
- The best practices that we performed for this week’s assignment were consistent and organized data entry, and accessible and reproducible research. We had clear labels for our datasets, and they are on accessible Excel spreadsheets with clear and detailed steps.
- I still can’t believe how long it took for them to finally pull their research after all of the red flags that Baggerly and Coombs found. After finding the report of so many errors, you would think the scientific community would look into them, especially when the responses from Potti and Nevins were not clear and did not provide documentation.
Cwong34 (talk) 20:35, 23 October 2017 (PDT)
BIOL/CMSI 367-01: Biological Databases Fall 2017
Assignments
- Week 1
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- Week 14
- Week 15
Journal Entries:
- cwong34 Week 2
- cwong34 Week 3
- cwong34 Week 4
- cwong34 Week 5
- cwong34 Week 6
- cwong34 Week 7
- cwong34 Week 8
- cwong34 Week 9
- cwong34 Week 10
- cwong34 Week 11
- cwong34 Week 12
- cwong34 Week 14
- cwong34 Week 15
Shared Journals:
- cwong34 Week 1 Journal
- cwong34 Week 2 Journal
- cwong34 Week 3 Journal
- cwong34 Week 4 Journal
- cwong34 Week 5 Journal
- cwong34 Week 6 Journal
- cwong34 Week 7 Journal
- cwong34 Week 8 Journal
- cwong34 Week 9 Journal
- cwong34 Week 10 Journal
Group Project
Emma Tyrnauer's Responses
- The main issues with the data and analysis identified by Baggerly and Coombs was that the research was not reproducible. In fact, the review pannel admitted that they "were unable to identify a place where the statistical methods were described in sufficient detail to independently replicate the findings of the papers." Furthermore, statistical mistakes were made and propagated through mislabeling of data (through accidental switches and offsets). The best practices enumerated by DataONE that were violated were records of experimental design to allow for reproducible research and easy identification of errors. The most common errors identified were centered around accidental mislabeling of data.
- Dr. Baggerly recommends labeling columns and samples and providing code. These correspond to what DataONE recommends because they allow for the research to be easily reproduced by anyone.
- This week I made sure to correctly label all columns as well as I kept an electronic notebook recording the process I used to analyze my data set.
- This weeks assignment as well as learning about the Duke deception made me realize how easy it is to propagate incorrect data and the importance of reproducibility. If researches took a little more time to follow the recommendations of Dr. Baggerly, errors like this could be avoided or identified earlier.
Emmatyrnauer (talk) 20:48, 23 October 2017 (PDT)
Links
- My User Page
- List of Assignments
- List of Journal Entries
- List of Shared Journal Entries
- Class Journal Week 1
- Class Journal Week 2
- Class Journal Week 3
- Class Journal Week 4
- Class Journal Week 5
- Class Journal Week 6
- Class Journal Week 7
- Class Journal Week 8
- Class Journal Week 9
- Class Journal Week 10
- Group Journal Week 11
- Group Journal Week 12
- no week 13
- Group Journal Week 14 (executive summary)
- Group Journal Week 14 (executive summary)
- Group Journal Week 15 (executive summary)
Blair Hamilton's Responses
- What were the main issues with the data and analysis identified by Baggerly and Coombs? What best practices enumerated by DataONE were violated? Which of these did Dr. Baggerly claim were common issues?
- The biggest issue with the data was they were unable to reproduce it. Much like our Week 8 assignment, if someone is following the exact same steps and is not getting the same answers, clearly something is off. Beggarly and Coombs found errors, discrepancies, inconsistent formatting and offset data. One of the best practices violated was the improper labeling of data, as well as the procedure section was not adequately done. Dr. Beggarly claimed that the documentation for formatting was incomplete, with genes being mislabeled and overall data input errors.
- What recommendations does Dr. Baggerly recommend for reproducible research? How do these correspond to what DataONE recommends?
- Dr. Beggarly recommended reusing templates to create consistency, more thorough labeling/summarizing data, data with description, literature programming, reporting structure, and adding appendices. Similar to DataONE which says that consistency among data, procedure walkthroughs, documentation of assumptions and compatibility are good practices for data management.
- What best practices did you perform for this week's assignment?
- This week we practiced consistent location of information. For example, when following the steps we made sure that each entry lined up and is easily read and followed. This is especially important for our yeast data because of the size of the data set (i.e. 6189). If the data isn't positioned correctly it is harder to find, read and sift through. Also, we practiced good labeling for columns. Each header explains the type of data below as well as which data is being referenced. For example, when taking the average of t15 the average is labeled with t15 so a viewer can see it is not referring to any other time period.
- Do you have any further reaction to this case after viewing Dr. Baggerly's talk?
- I am still so amazed at the amount of ignorance Duke had dealing with this case. Given that Baggerly and Coombs found so many discrepancies and it took someone finding fault in the doctors resume for it to be even explored further is dumbfounding. I am also fascinated with how Baggerly and Coombs were able to describe these incidents with the data and now make it a teachable moment for other future researchers. Although what happened at Duke is truly strange, it is a great example of how data should be organized to make this never happen again.
Bhamilton18 (talk) 22:23, 23 October 2017 (PDT)
Aporras1 Response
- What were the main issues with the data and analysis identified by Baggerly and Coombs? What best practices enumerated by DataONE were violated? Which of these did Dr. Baggerly claim were common issues?
- What recommendations does Dr. Baggerly recommend for reproducible research? How do these correspond to what DataONE recommends?
- What best practices did you perform for this week's assignment?
- Do you have any further reaction to this case after viewing Dr. Baggerly's talk?
User Page: Antonio Porras
Assignments
- Week 1 Assignment
- Week 2 Assignment
- Week 3 Assignment
- Week 4 Assignment
- Week 5 Assignment
- Week 6 Assignment
- Week 7 Assignment
- Week 8 Assignment
- Week 9 Assignment
- Week 10 Assignment
- Week 11 Assignment
- Week 12 Assignment
- Week 14 Assignment
- Week 15 Assignment
Individual Journal Entries
- Week 1
- Week 2
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- Week 14
- Week 15
Class Journal Entries
- Class Journal Week 1
- Class Journal Week 2
- Class Journal Week 3
- Class Journal Week 4
- Class Journal Week 5
- Class Journal Week 6
- Class Journal Week 7
- Class Journal Week 8
- Class Journal Week 9
- Class Journal Week 10
Team Page
Individual Assessment and Reflection
Simon Wroblewski's Reflections
- What were the main issues with the data and analysis identified by Baggerly and Coombs? What best practices enumerated by DataONE were violated? Which of these did Dr. Baggerly claim were common issues?
- The main issues identified by Baggerly and Coombs has to be the fact the the data was not reproducible and therefore incapable of being verified. Many practices from DataONE were violated, such as mislabeling, indexing, and manipulating data.
- What recommendations does Dr. Baggerly recommend for reproducible research? How do these correspond to what DataONE recommends?
- Dr Baggerly recommends a more strict and regimented protocol, which include: (1) Following conventions for labeling data, (2) Heavy documentation of all processes, and (3) Most importantly having a reproducible workflow. DataOne reinforces these practices, especially in reference to reproducibility of all experiments coupled with proper documentation.
- What best practices did you perform for this week's assignment?
- Do you have any further reaction to this case after viewing Dr. Baggerly's talk?