Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reproducibility in scientific programming

Along with producing incorrect results, one of the worst fears in scientific programming is not being able to reproduce the results you've generated. What best practices help ensure your analysis is reproducible?

like image 578
Andrew Grimm Avatar asked Apr 29 '10 01:04

Andrew Grimm


People also ask

What is reproducibility in computer science?

The reproducibility standard is based on the fact that every computational experiment has, in theory, a detailed log of every action taken by the computer.

What is reproducibility in scientific research?

B1: “Reproducibility” refers to instances in which the original researcher's data and computer codes are used to regenerate the results, while “replicability” refers to instances in which a researcher collects new data to arrive at the same scientific findings as a previous study.

What is reproducibility in data science?

Reproducibility is and has been a fundamental principle in the scientific method, the term refers to the ability to take the original researcher's data and analysis to generate the same results, the term is often used interchangeably with replicability, another fundamental principle in which a researcher collects new ...

Why is reproducibility so important to scientists?

Reproducibility is essential to science because it allows for more thorough research while replicability confirms our results. Many studies and experiments exist, leading to many different variables, unknowns, and things out of your control or that you cannot guarantee.


1 Answers

  • Publish the original raw data online and make it freely available for download.
  • Make the code base open source and available online for download.
  • If randomization is used in optimization, then repeat the optimization several times, choosing the best value that results or use a fixed random seed, so that the same results are repeated.
  • Before performing your analysis, you should split the data into a "training/analysis" dataset and a "testing/validation" dataset. Perform your analsysis on the "training" dataset, and make sure that the results that you get still hold on the "validation" dataset to ensure that your analysis is actually generalizable and isn't simply memorizing peculiarities of the dataset in question.

The first two points are incredibly important, because making the dataset available allows others to perform their own analyses on the same data, which increases the level of confidence in the validity of your own analyses. Additionally, making the dataset available online -- especially if you use linked data formats -- makes it possible for crawlers to aggregate your dataset with other datasets, thereby enabling analyses with larger data sets... in many types of research, the sample size is sometimes too small to be really confident about the results... but sharing your dataset makes it possible to construct very large datasets. Or, someone could use your dataset to validate the analysis that they performed on some other dataset.

Additionally, making your code open source makes it possible for the code and procedure to be reviewed by your peers. Often such reviews lead to the discovery of flaws or of the possibility for additional optimization and improvement. Most importantly, it allows other researchers to improve on your methods, without having to implement everything that you have already done from scratch. It very greatly accelerates the pace of research when researches can focus on just improvements and not on reinventing the wheel.

As for randomization... many algorithms rely on randomization to achieve their results. Stochastic and Monte Carlo methods are quite common, and while they have been proven to converge for certain cases, it is still possible to get different results. The way to ensure that you get the same results, is to have a loop in your code that invokes the computation some fixed number of times, and to choose the best result. If you use enough repititions, you can expect to find global or near-global optima instead of getting stuck in local optima. Another possibility is to use a predetermined seed, although that is not, IMHO, as good an approach since you could pick a seed that causes you to get stuck in local optima. In addition, there is no guarantee that random number generators on different platforms will generate the same results for that seed value.

like image 67
Michael Aaron Safyan Avatar answered Oct 03 '22 03:10

Michael Aaron Safyan