I am coming from Java background and completely new at Python.
Now I have got a Python project that consists of a few Python scripts and pickle
files stored in Git. The pickle files are serialized sklearn models.
I wonder how to organize this project. I think we should not store the pickle files in Git. We should probably store them as binary dependencies somewhere.
Does it make sense ? What is a common way to store binary dependencies of Python projects
To save a pickle, use pickle. dump . A convention is to name pickle files *. pickle , but you can name it whatever you want.
The pickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.
Pickle's ProsPickle constructs arbitrary Python objects by invoking arbitrary functions, that's why it is not secure. However, this enables it to serialise almost any Python object that JSON and other serialising methods will not do.
Git is just fine with binary data. For example, many projects store e.g. images in git repos.
I guess, the rule of thumb is to decide whenever your binary files are source material, an external dependency, or an intermediate build step. Of course, there are no strict rules, so just decide how you feel about them. Here are my suggestions:
If they're (reproducibly) generated from something, .gitignore
the binaries and have scripts that build the necessary data. It could be in the same, or in a separate repo - depending on where it feels best.
Same logic applies if they're obtained from some external source, e.g. an external download. Usually, we don't store dependencies in the repository - we only keep references to them. E.g. we don't keep virtualenvs but only have requirements.txt file - the Java world analogy is (a rough approximation) like not having .jars but only pom.xml or a dependencies section in build.gradle.
If they can be considered to be a source material, e.g. if you manipulate them with Python as an editor - don't worry about the files' binary nature and just have them in your repository.
If they aren't really a source material, but their generation process is really complicated or takes very long, and the files aren't meant to be updated on a regular basis - I think it won't be terribly wrong to have them in the repo. Leaving a note (README.txt or something) about how the files were produced would be a good idea, of course.
Oh, and if the files are large (like, hundreds of megabytes or more), consider taking a look at git-lfs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With