Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to only load part of a TensorFlow dataset?

I have a notebook in Google Colab with the following code:

batch_size = 64
dataset_name = 'coco/2017_panoptic'

tfds_dataset, tfds_info = tfds.load(
    dataset_name, 
    split='train', 
    with_info=True)

I would like to know if it possible to only download part of the dataset (say: 5%, or X number of images) with the tfds_load function. As far as I can see in the documentation, there are no arguments to do so. Of course it would be possible to slice the dataset after dowloading, but this particular dataset (coco/2017_panoptic) is 19.57 GiB, which obviously takes quite a while to download.

like image 907
Sytze Avatar asked Oct 23 '25 04:10

Sytze


1 Answers

The original question was about how to download a subset of the dataset.

And so the answer recommending the use of an argument like split='train[:5%]' as a way of downloading only 5% of the training data is mistaken. It seems that this still downloads the entire dataset, but then only loads 5%.

You can check this for yourself by running mnist_ds_5p = tfds.load("mnist", split="train[:5%]") followed by mnist_ds = tfds.load("mnist", split="train")

No downloading takes place after running the second command. This is because the entire dataset has already been downloaded and cached after running the first command!

As many of the datasets are being fetched from a compressed form I doubt there is a simple way to avoid downloading the entire dataset I'm afraid.

like image 176
Chris Gumb Avatar answered Oct 25 '25 18:10

Chris Gumb



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!