I am coming from Java background and completely new at Python. Now I have got a Python project that consists of a few Python scripts and <code>pickle</code> files stored in Git. The pickle files are serialized sklearn models. I wonder how to organize this project. I think we should not store the pickle files in Git. We should probably store them as binary dependencies somewhere. Does it make sense ? What is a common way to store binary dependencies of Python projects

Git is just fine with binary data. For example, many projects store e.g. images in git repos. I guess, the rule of thumb is to decide whenever your binary files are source material, an external dependency, or an intermediate build step. Of course, there are no strict rules, so just decide how you feel about them. Here are my suggestions: <ol> <li>If they're (reproducibly) generated from something, <code>.gitignore</code> the binaries and have scripts that build the necessary data. It could be in the same, or in a separate repo - depending on where it feels best.</li> <li>Same logic applies if they're obtained from some external source, e.g. an external download. Usually, we don't store dependencies in the repository - we only keep references to them. E.g. we don't keep virtualenvs but only have requirements.txt file - the Java world analogy is (a rough approximation) like not having .jars but only pom.xml or a dependencies section in build.gradle.</li> <li>If they can be considered to be a source material, e.g. if you manipulate them with Python as an editor - don't worry about the files' binary nature and just have them in your repository.</li> <li>If they aren't really a source material, but their generation process is really complicated or takes very long, and the files aren't meant to be updated on a regular basis - I think it won't be terribly wrong to have them in the repo. Leaving a note (README.txt or something) about how the files were produced would be a good idea, of course.</li> </ol> Oh, and if the files are large (like, hundreds of megabytes or more), consider taking a look at git-lfs.

How to organize a Python project with pickle files?

1 Answers

Git is just fine with binary data. For example, many projects store e.g. images in git repos.

I guess, the rule of thumb is to decide whenever your binary files are source material, an external dependency, or an intermediate build step. Of course, there are no strict rules, so just decide how you feel about them. Here are my suggestions:

If they're (reproducibly) generated from something, .gitignore the binaries and have scripts that build the necessary data. It could be in the same, or in a separate repo - depending on where it feels best.
Same logic applies if they're obtained from some external source, e.g. an external download. Usually, we don't store dependencies in the repository - we only keep references to them. E.g. we don't keep virtualenvs but only have requirements.txt file - the Java world analogy is (a rough approximation) like not having .jars but only pom.xml or a dependencies section in build.gradle.
If they can be considered to be a source material, e.g. if you manipulate them with Python as an editor - don't worry about the files' binary nature and just have them in your repository.
If they aren't really a source material, but their generation process is really complicated or takes very long, and the files aren't meant to be updated on a regular basis - I think it won't be terribly wrong to have them in the repo. Leaving a note (README.txt or something) about how the files were produced would be a good idea, of course.

Oh, and if the files are large (like, hundreds of megabytes or more), consider taking a look at git-lfs.

answered Sep 21 '22 01:09

drdaeman

Related questions
                            
                                Pandas: how to groupby with count with multiple levels on rows?
                            
                                Pass in a list of possible routes to Flask?
                            
                                tkk checkbutton appears when loaded up with black box in it
                            
                                How can I use tensorflow metric function within keras models?
                            
                                Groupby.transform doesn't work in dask dataframe
                            
                                How to give name to each node in celery
                            
                                Need help combining two 3 channel images into 6 channel image Python
                            
                                Sqlalchemy representation for custom postgres range type
                            
                                How to pass user object to forms in Django
                            
                                Changing extents or axis limits on complex Holoviews figures
                            
                                How to extract unique permutations from pandas DataSeries?
                            
                                How can I send a plot.ly image inline of an html email using smtp?
                            
                                What is the purpose of __table_args__ in sqlalchemy?
                            
                                pytest: run test from code, not from command line
                            
                                Printing value in each bin in hist2d (matplotlib)
                            
                                Combining rows to 'others' in pandas
                            
                                How can I request (get) and read an xml file using python?
                            
                                Reproducing LASSO / Logistic Regression results in R with Python using the Iris Dataset
                            
                                How to convert string labels to one-hot vectors in TensorFlow?
                            
                                Python: save attachments from .msg files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to organize a Python project with pickle files?

Tags:

git

python

pickle

Michael

People also ask

1 Answers

drdaeman

Recent Activity

Donate For Us