Reproducibility in scientific programming

Tags:

scientific-computing

Along with producing incorrect results, one of the worst fears in scientific programming is not being able to reproduce the results you've generated. What best practices help ensure your analysis is reproducible?

578

asked Apr 29 '10 01:04

Andrew Grimm

1 Answers

Publish the original raw data online and make it freely available for download.
Make the code base open source and available online for download.
If randomization is used in optimization, then repeat the optimization several times, choosing the best value that results or use a fixed random seed, so that the same results are repeated.
Before performing your analysis, you should split the data into a "training/analysis" dataset and a "testing/validation" dataset. Perform your analsysis on the "training" dataset, and make sure that the results that you get still hold on the "validation" dataset to ensure that your analysis is actually generalizable and isn't simply memorizing peculiarities of the dataset in question.

The first two points are incredibly important, because making the dataset available allows others to perform their own analyses on the same data, which increases the level of confidence in the validity of your own analyses. Additionally, making the dataset available online -- especially if you use linked data formats -- makes it possible for crawlers to aggregate your dataset with other datasets, thereby enabling analyses with larger data sets... in many types of research, the sample size is sometimes too small to be really confident about the results... but sharing your dataset makes it possible to construct very large datasets. Or, someone could use your dataset to validate the analysis that they performed on some other dataset.

Additionally, making your code open source makes it possible for the code and procedure to be reviewed by your peers. Often such reviews lead to the discovery of flaws or of the possibility for additional optimization and improvement. Most importantly, it allows other researchers to improve on your methods, without having to implement everything that you have already done from scratch. It very greatly accelerates the pace of research when researches can focus on just improvements and not on reinventing the wheel.

As for randomization... many algorithms rely on randomization to achieve their results. Stochastic and Monte Carlo methods are quite common, and while they have been proven to converge for certain cases, it is still possible to get different results. The way to ensure that you get the same results, is to have a loop in your code that invokes the computation some fixed number of times, and to choose the best result. If you use enough repititions, you can expect to find global or near-global optima instead of getting stuck in local optima. Another possibility is to use a predetermined seed, although that is not, IMHO, as good an approach since you could pick a seed that causes you to get stuck in local optima. In addition, there is no guarantee that random number generators on different platforms will generate the same results for that seed value.

answered Oct 03 '22 03:10

Michael Aaron Safyan

Related questions
                            
                                Java Multi-Threading Beginner Questions
                            
                                Is C really used for a lot of Scientific Computing?
                            
                                Solving nonlinear equations numerically
                            
                                How to get peak memory usage of python script?
                            
                                Help with symplectic integrators
                            
                                Why are there no BLAS routines for addition and subtraction
                            
                                statsmodels ARIMA.fit: Hide output
                            
                                Plotting log-binned network degree distributions
                            
                                How to deal with underflow in scientific computing?
                            
                                Methods for entering equations while programming in C/C++ , Python or Fortran
                            
                                How to represent scientific notation in C
                            
                                Modern language with the advantages of FORTRAN?
                            
                                Floating-point optimizations - guideline
                            
                                Qt vs Visual Studio for scientific computing [closed]
                            
                                Scientific Programming Stack for Clojure
                            
                                Interpolation in SciPy: Finding X that produces Y
                            
                                Sparse Matrix Libraries for Ruby
                            
                                What is a good free (open source) BLAS/LAPACK library for .net (C#)? [closed]
                            
                                Writing a faster Python physics simulator
                            
                                How to organize a set of scientific experiments using Git

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With