Class Journal Week 8
Zachary Van Ysseldyk's Responses
- The main issues with the data and analysis identified was that the data was not reproducible - actually, that the data was very very far from reproducible. They reused the same genes in statistical analysis, their indexing was off by one, and some of their verification tables (namely the 59 gene ovarian cancer model) matched 0% of what the line was supposed to be. Based on his overall observations, he says the most common mistakes are simple. He says how this simplicity is often hidden, and furthers to say that the most simple mistakes are common. Specifically, he found the most common mistakes concerning: Mixing up sample labels, Mixing up the gene labels, Mixing up the group labels, and incomplete documentation. He notes how the MOST common mistake is the complete confounding in the Experimental design. Many of these points Dr. Baggerly expresses have been brought up when looking at DataOne. For one, not all of the labels are clear. Furthermore, the workflow is not easily reproducible.
- Baggerly first suggests that the data should be labeled in order to clearly be able to tell which data is which.The biggest thing that he expresses, of course, is the reproducibility of the workflow. All of the suggestions Beggarly expresses basically points towards having the data be reproducible. DataOne also stresses and strongly advocated the essential practice for proper data documentation.
- The main best practice that we performed was the reproducibility of the data as outlined on the individual work page. We were able to cater the instructions to our specific gene so that the analyses could be easily reproduced. We also made sure that all of the genes were labeled. Putting in summaries and electronic workbooks helps the user to have an overview of the project which enables them to have a clear objective going into the project.
- It seemed like the press and organizations didn't take him that seriously at first. Beggarly didn't seem to upset about it during his lecture, but I would be a little angry having gone through that much work just to have been brushed off. Although I am not looking to go into a biology rated workplace, his enthusiasm about the subject was inspiring. Also I liked his quirkiness.
Zvanysse (talk) 19:40, 23 October 2017 (PDT)
BIOL/CMSI 367-01: Biological Databases Fall 2017
Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | Week 10 | Week 11 | Week 12 | Week 14
Individual Assignments
Zvanysse Week 1 | Zvanysse Week 2 | Zvanysse Week 3 | Zvanysse Week 4 | Zvanysse Week 5 | Zvanysse Week 6 | Zvanysse Week 7 | Zvanysse Week 8 | Zvanysse Week 9 | Zvanysse Week 10 | Zvanysse Week 11 | Zvanysse Week 12 | Zvanysse Week 14 | Zvanysse Week 15
Zvanysse Week 1 Journal | Zvanysse Week 2 Journal | Zvanysse Week 3 Journal | Zvanysse Week 4 Journal | Zvanysse Week 5 Journal | Zvanysse Week 6 Journal | Zvanysse Week 7 Journal | Zvanysse Week 8 Journal | Zvanysse Week 9 Journal | Zvanysse Week 10 Journal | Zvanysse Week 11 Journal | Zvanysse Week 12 Journal | Zvanysse Week 14 Journal
QLanners Responses
- There were a number of issues with the data and analysis identified by Baggerly and Coombs. Some of the main issues included a universal off-by-one indexing error brought about by poor attention to the software being used, an inaccurate use of secndary source data (their data labels seemed to be flipped from the published data labels), the use of duplicate data, and very poor documentation in general. The review panel even said that they could not figure out from the published data how to reproducce the work without some sort of outside help. A number of best practices enumarted by DataOne were broken, including a failure to maintain dataset provenance, a lack of documentation of all assumptions, a lack of any form of repdocuible workflow documentation, and very poor labeling techniques. Dr. Baggerly claimed that several of these were common mistakes, most prominently the off-by-one indexing error and the mixing-up of labels. Dr. Baggerly also pointed out how it is the poor documentation that often leads to these easy mistakes going undetected.
- Dr. Baggerly recommend more thorough documentation of the data, namely through labels for all published data. He also recommends a stricter requirement for data provenance and for code to be published along with the data. Overall, Dr. Baggerly stresses the need for the research to be reproducible. The corresponds very closely with what DataOne recommends, as several of the best practices (as outlined above) are essential for properly documenting data and data analysis and ensuring that someone else can perform the exact same steps on the data in the future using just the documentation.
- In this weeks assignment were performed a number of best practices. We ensured to provide distinct labels for all of our data points, we labeled our data files as descriptive names, we appropriately handled missing data, and we kept documentation on how we performed our data analysis so that it could be reproduced in the future by somebody else.
- I was very surprised at all of the pushback that Dr. Baggerly received from the scientific journals when he shared the errors in the data. I would have thought that scientific journals would have been much more committed to ensuring that the papers that they had published were accurate and would have been more helpful to Dr. Baggerly and tough on the Duke research team. I think going forward a higher sense of accountability needs to be adopted in the scientific field to avoid scenarios like this.
Qlanners (talk) 17:28, 22 October 2017 (PDT)
QLanners Links
Main Page
User Page
Assignment Pages: Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | Week 10 | Week 11 | Week 12 | Week 14 | Week 15
Journal Entry Pages: Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | Week 10 | Week 11 | Week 12 | Week 14 | Week 15
Shared Journal Pages: Week 1 | Week 2 | Week 3 | Week 4 | Week 5 | Week 6 | Week 7 | Week 8 | Week 9 | Week 10
Group Project Page: JASPAR the Friendly Ghost
Mary Balducci's Responses
- The main issue with the data and analysis was that the results were not reproducible. There were errors with mislabeling data, as well as indexing errors. The data did not make sense, and their methods were not clear without outside help. Simple were also easy to miss due to poor documentation and recording of data. Dr. Baggerly claims that the most common mistakes are mixing up sample labels, mixing up gene labels, mixing up group labels, and incomplete documentation.
- Dr. Baggerly recommends labelling table columns, having a code, describing steps, especially steps which are planned in advance. These are very similar to DataOne's recommendations of labelling and having good documentations.
- Best practices that I performed for this week were labelling my data points with headers that explain exactly what the data point is. I also kept a detailed outline of every step I took to get my results.
- My reaction to this case after viewing the video is that I'm still shocked that this went on for so long. It seems like it was very obvious that the data was not reliable and yet it was allowed to go as far as clinical trials.
Mbalducc (talk) 20:44, 22 October 2017 (PDT)
Eddie Azinge's Responses
Cazinge (talk) 20:13, 23 October 2017 (PDT)
Katie Wright's Response
- Baggerly and Coombs were not able to reproduce the data they analyzed after pouring over the data and using every available means of "Forensic Bioinformatics." And this was not because of one specific error, but because of a multitude of errors. Often times, the data was not labeled correctly or software was used incorrectly that lead to the mislabeling of data (the +1 problem). The best practices violated were consistency in data labeling and documentation. Mislabeling and misdocumentation are some of the most common errors in data analysis, and wouldn't be such an enormous problem if they were just done properly in the first place.
- Dr. Baggerly and DataONE both reccommend creating "reproducible workflow." Your process and reasoning should be transparent and understandable so it can be evaluated/critiqued by others. Processes should also be automated wherever possible.
- For this week we
- Documented entire procedure in minute detail (thanks to procedure provided by professors in week 8 assignment)
- formatted Excel spreadsheet with no spaces between rows or columns, and new worksheets were created often to provide a step-by-step look at how the dataset was analyzed/manipulated.
- I think this talk just made me more angry about the whole fiasco. There were multiple "disturbing" errors (as Dr. Baggerly called them) that were pointed out to journals time after time. It took so long for the scientific community to listen to the biostatisticians and take an in-depth look at the data. I think that Dr. Baggerly makes a very good suggestion when he says that every institution should have their own biostatisticians independently review and reproduce the data analysis for every experiment before it is published.