Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reproducible builds in python

I need to ship a compiled version of a python script and be able to prove (using a hash) that the compiled file is indeed the same as the original one.

What we use so far is a simple:

find . -name "*.py" -print0 | xargs -0 python2 -m py_compile

The issue is that this is not reproducible (not sure what are the fluctuating factors but 2 executions will not give us the same .pyc for the same python file) and forces us to always ship the same compiled version instead of being able to just give the build script to anyone to produce a new compiled version.

Is there a way to achieve that?

Thanks

like image 559
Martin Trigaux Avatar asked Sep 13 '16 13:09

Martin Trigaux


People also ask

Why are reproducible builds important?

Reproducible builds can also provide assurances around what software has been and will be shipped. If you know that your build process can be 100% bit-for-bit reproduced when given the same set of build inputs, you can trace any release, past or present, back to source.

What is reproducibility in coding?

In the context of statistics and data science, reproducibility means that our code—a map from data to estimates or predictions—should not depend on the specific computational environment in which data processing and data analysis originally took place.

Is Bazel good for Python?

Bazel is one of the best solutions available for creating reproducible, hermetic builds. It supports many languages like Python, Java, C, C++, Go, and more. Start by installing Bazel. To build our Flask application, we need to instruct Bazel to use python 3.8.

Which file should you use to create reproducible builds for Docker images?

Wiki's and Readme files are the most common ways to document a build. While documenting a build is better than not documenting it, Wikis and Readme files have two flaws when used to describe a process: 1) they require humans to read them, and 2) they evolve.


2 Answers

Compiled Python files include a four-byte magic number and the four-byte datetime of compilation. This probably accounts for the discrepancies you are seeing.

If you omit bytes 5-8 from the checksumming process then you should see constant checksums for a given version of Python.

The format of the .pyc file is given in this blog post by Ned Batchelder.

like image 74
holdenweb Avatar answered Nov 15 '22 04:11

holdenweb


2019 / python3.7+ update: since PEP 552

python -m compileall -f --invalidation-mode=checked-hash [file|dir]
# or
export SOURCE_DATE_EPOCH=1 # set py_compile to use 
python -m py_compile       # pycompile.PycInvalidationMode.CHECKED_HASH

will create .pycs which will not change until their source code changes.

like image 41
Steven Kalt Avatar answered Nov 15 '22 02:11

Steven Kalt