In data science, replicability and reproducibility are some of the keys to data integrity. Three main topics can be derived from the concept: data replicability, data reproducibility and research reproducibility. These may sound similar, but they are actually quite different. We have covered these three topics and their differences over the course of three articles. We started with data replicability and data reproducibility and now we move to reproducible research.
To begin, how do you define reproducible research? It is the ability to perform a data analysis and achieve the same results as someone else. Where data replication and reproduction are related to actually generating data, research reproduction is solely repeating the analysis.
Evidence of correctness
An obvious reason for reproducing research and repeating analyses is to confirm that the original results are indeed correct. When another scientist reproduces your research and comes to the same conclusion, it is more likely that the conclusion is correct. The same goes for negative results, where reproducing the analysis can confirm if they are erroneous or statistically insignificant.
There can be different ways to analyze the same results, meaning there is the potential to reach different conclusions through different analyses. These different conclusions are in turn interesting because findings and claims can be not only built upon but new observations can be made. Shedding new light on results and considering them in alternative ways might lead to new discoveries or give reason to take the research in a new direction.
Increasing complexity of data analyses
These days the complexity of data analysis has increased remarkably. Data sets are larger and computations are more sophisticated. This demands reproducibility in order to reduce the error and bias when humans are added to the process of data analysis. It is inevitable, scientists are human too and we are not immune to mistakes. By ensuring research is reproducible it demands the raw data be made available and helps others make full analyses, which are not biased by previous ones.
The focus remains on the content of the data analysis
Reproducibility allows for the important part of data analysis not to be missed. Often data analyses get lost in superficial summaries, written to convince the reader of significance. But by making the research reproducible the step-by-step process must be included, so others can understand how and why conclusions were drawn. This also then gives more context to the analysis. Enabling others to follow the process themselves and come to their own conclusions.
A step you can take towards reproducible research while you are still carrying out your work is version control. This means continuously making records of data and files as you work on them. Doing so enables you and, more importantly, others to refer back to specific points in your research. This provides context to your research process should others be trying to understand it. These versions should naturally be made available to others.
Report research and data analysis methods
Carrying out and making version control available is useful, but only when it is accompanied by thorough reports. These reports should be about the methods used, the process as a whole and the data analysis. It is advisable to write at least notes on the methods while you are carrying out your work. This is because it might be tricky to remember how many times you wash a membrane or how long you shook those plates for. Also to be made available is, of course, the raw data, to ensure others are able to carry out the entire analysis independently should they wish.
Clearly link claims to the underlying data
You should make sure that is clear to readers how you reached certain conclusions. Just because something makes sense to you, it is not definite that it will do so for others. When making claims based upon your analysis, you should link them to the data you are basing them upon. This is to make sure that others reading your work can easily follow the steps you took from data to conclusions. Having full context then makes it easier for them to understand your work process and relate to it when reproducing it for themselves.
Digital lab notebook
Using a digital lab notebook can hold numerous advantages to research reproducibility. The first relates to version control. With labfolder, you are able to export your notebook’s entire contents as an XHTML file to store as offline archives. labfolder also automatically documents a full audit trail for all of your entries. This means that from the time and date an entry is created, all edits to it are recorded. In your ELN you are able to go through the edit histories and view each version of the entry.
Having an ELN also makes it easy to make notes about your research and methods while you are experimenting, which helps ensure full reporting. Another benefit is that all of the raw data remains available for you and if you wish, for others, to access and refer to. This helps transparency and again makes it easier for others to understand your workflow.