The significance of reproducible data

In data science, replicability and reproducibility are some of the keys to data integrity. Three main topics can be derived from the concept: data replicability, data reproducibility, and research reproducibility. These may sound similar, but they are actually quite different. We will cover these three topics and their differences over the course of three articles. We started with data replicability, now we shall move onto data reproducibility.

So, how to define data reproducibility? In one way, it is a less strict way of looking at replicability. This means if an experiment is reproducible, it is not necessarily replicable. This is because you can reproduce an experiment even when other methods were used, so long as you achieve the same results. Below we will look into why data reproducibility is necessary and how you can ensure this.

The significance of reproducible data

Why is data reproducibility important?
How can you ensure data reproducibility?

Why is data reproducibility important?

It demands condition changes

The first reason data reproducibility is significant is that it creates more opportunity for new insights. This is because you need to make changes to the experiment to reproduce data, still with the aim of achieving the same results. When you change conditions, you not only see different ways of getting the same results, but you shed light on possibilities that may not have been previously considered. This may be the disproving of a hypothesis or conception of a new one.

Reduce error risks

It is always advisable to have some sort of repetition for experiments. This is to double-check things were done correctly and increase reliability. Additionally, through data reproduction, you can reduce the chance of flukes and mistakes. In the same experimental settings, you might miss mistakes, or even get into a habit of them when repeating steps over and over. By having new conditions and using different techniques, you should be pulled out of any bad habit. Additionally, you can also identify easily if the previous technique’s results were fortuitous.

Validate results

We need data replication to confirm our results. We need data reproduction for more thorough research. One reason is the chance for new insights and reducing errors. More importantly, the nature of reproducing strengths data, results and the analysis. It is now widely agreed that data reproducibility is a key part of the scientific process. This means that you should consider it a regular practice to make data reproducible and where feasible, reproduce it or have others do so.

It is the only thing you can guarantee in a study

In research, studies and experiments, there are many variables, unknowns and things that you cannot guarantee. But the one thing you can ensure in your work is its reproducibility. Due to the nature of science, you cannot be sure that the results are correct or will remain correct. When you ensure reproducibility, you provide transparency with your experiment and allow others to understand what was done; whether they will go on to reproduce the data or not.

How can you ensure data reproducibility?

Make the raw data available

In order to reproduce data or for others to do so, you should ensure that the raw data sets are available. This is for reference since the aim of reproducing data is achieving the same results. This data should truly be raw, unmodified and as you collected it before any analysis. Providing the root of the data allows proper reflection once it has been reproduced. You can identify any differences and similarities between it and the original data.

A key medium for enabling this is Figshare, your digital data repository. With Figshare you are able to upload your raw data and then choose to share it with others if you publish using said data. Within labfolder, there is integration with Figshare so you can easily export your notebook contents.

Transparent reporting

Just as if you were preparing your data to be replicable, you should be totally transparent with all aspects of your data to enable reproducibility. This is not only because it is good practice, but because it allows others to fully understand the steps you took to achieve the results you did. This applies to reporting on experiment performance, techniques and tools used, data collection methods and analysis.

Another crucial part of transparency is being open with negative and statistically insignificant results. Often, we would ignore these, but to enable full reproducibility, there must be full transparency. This applies whether you are the first to carry out an experiment or you are reproducing data.

If you are carrying out the reproduction of data, you should also be transparent and include all aspects of the research. You will need to specify which conditions you altered in the experiment, which included all the aspects listed above.

A Nature article proved it is common to fail to reproduce data, even your own. This indicates that more efforts than ever are needed to enable reproducibility.

Checklist

To make life easier for yourself, you can create a checklist of reporting criteria. This would be both for your own reference when carrying out experiments, as well as for others to follow when they reproduce your data. Having established criteria not only ensures thorough reporting but it makes it easier to compare results and ensure that the data was properly reproduced.

The Nature article further presented that just over a third of scientists surveyed do not have any procedures in place.

Digital lab notebook

Adopting a digital lab notebook can aid your efforts since you can make to-do lists that can act as checklists within your notebook. You are also able to make protocols and templates, which can be shared with others for when they are reproducing the data. With your ELN you can record and make notes as you experiment, so you ensure you record each step correctly. You also enter the raw data directly into your ELN. There you can view, analyze and easily share it with others when you need to.

Research Data Management (RDM) is an overarching process that guides researchers through the many stages of the data lifecycle. In doing so, it enables scientists and stakeholders alike to make the most out of generated research data. Electronic lab notebooks simplify the creation of effective RDM plans and enable researchers to easily put them into action for a better, reproducible, transparent and open science.

To discover how to optimize RDM strategies, check out our guide on effective Research Data Management.

Start for free