I'd like to create some ridiculously-easy-to-use pip packages for loading common machine-learning datasets in Python. (Yes, some stuff already exists, but I want it to be even simpler.) What I'd like to achieve is this: <ul> <li>User runs <code>pip install dataset</code> </li> <li>pip downloads the dataset, say via <code>wget http://mydata.com/data.tar.gz</code>. Note that the data does not reside in the python package itself, but is downloaded from somewhere else.</li> <li>pip extracts the data from this file and puts it in the directory that the package is installed in. (This isn't ideal, but the datasets are pretty small, so let's assume storing the data here isn't a big deal.)</li> <li>Later, when the user imports my module, the module automatically loads the data from the specific location.</li> </ul> This question is about bullets 2 and 3. Is there a way to do this with setuptools?

Python package installation states that it should never execute Python code in order to install Python packages. This means that you may not be able to download stuff during the installation process. If you want to download some additional data, do it after you install the package , for example when you import your package you could download this data and cache it somewhere in order not to download it at every new import.

Using setuptools, how can I download external data upon installation?

2 Answers

As alluded to by Kevin, Python package installs should be completely reproducible, and any potential external-download issues should be pushed to runtime. This therefore shouldn't be handled with setuptools.

Instead, to avoid burdening the user, consider downloading the data in a lazy way, upon load. Example:

def download_data(url='http://...'):
    # Download; extract data to disk.
    # Raise an exception if the link is bad, or we can't connect, etc.

def load_data():
    if not os.path.exists(DATA_DIR):
        download_data()
    data = read_data_from_disk(DATA_DIR)
    return data

We could then describe download_data in the docs, but the majority of users would never need to bother with it. This is somewhat similar to the behavior in the imageio module with respect to downloading necessary decoders at runtime, rather than making the user manage the external downloads themselves.

answered Sep 24 '22 06:09

rd11

Python package installation states that it should never execute Python code in order to install Python packages. This means that you may not be able to download stuff during the installation process.

If you want to download some additional data, do it after you install the package , for example when you import your package you could download this data and cache it somewhere in order not to download it at every new import.

answered Sep 25 '22 06:09

sorin

Related questions
                            
                                Vectorize integration of pandas.DataFrame
                            
                                Why python numpy.delete does not raise indexError when out-of-bounds index is in np array
                            
                                Python unittest failing to resolve import statements
                            
                                How to merge two pandas DataFrames based on a similarity function?
                            
                                Python: Generate random values from empirical distribution
                            
                                Solving reaction-diffusion system with Theano
                            
                                Django get all descendant child models using django queryset
                            
                                Speeding up distance between all possible pairs in an array
                            
                                How to preserve newlines in argparse version output while letting argparse auto-format/wrap the remaining help message?
                            
                                Fastest way to get union of lists - Python
                            
                                How To Change Bar Chart Values to Percentages (Matplotlib) [duplicate]
                            
                                Encode Base64 Django ImageField Stream
                            
                                Django settings not configured error
                            
                                Tensor Flow: Ran out of memory trying to allocate
                            
                                Summing data from array based on other array in Numpy
                            
                                Calculating IDF using TfidfVectorizer from sklearn.feature_extraction.text.TfidfVectorizer
                            
                                Preferred way to empty multiprocessing.queue(-1) in python
                            
                                How to output full diffs in Django unit tests?
                            
                                Using python to calculate radial angle, in clockwise/counterclockwise directions, given pixel coordinates (and then vice-versa)
                            
                                Handle CTRL-C in Python cmd module

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using setuptools, how can I download external data upon installation?

Tags:

python

pip

setuptools

rd11

People also ask

2 Answers

rd11

sorin

Recent Activity

Donate For Us