I have a notebook in Google Colab with the following code:
batch_size = 64
dataset_name = 'coco/2017_panoptic'
tfds_dataset, tfds_info = tfds.load(
dataset_name,
split='train',
with_info=True)
I would like to know if it possible to only download part of the dataset (say: 5%, or X number of images) with the tfds_load function. As far as I can see in the documentation, there are no arguments to do so. Of course it would be possible to slice the dataset after dowloading, but this particular dataset (coco/2017_panoptic) is 19.57 GiB, which obviously takes quite a while to download.
The original question was about how to download a subset of the dataset.
And so the answer recommending the use of an argument like split='train[:5%]' as a way of downloading only 5% of the training data is mistaken. It seems that this still downloads the entire dataset, but then only loads 5%.
You can check this for yourself by running
mnist_ds_5p = tfds.load("mnist", split="train[:5%]")
followed by mnist_ds = tfds.load("mnist", split="train")
No downloading takes place after running the second command. This is because the entire dataset has already been downloaded and cached after running the first command!
As many of the datasets are being fetched from a compressed form I doubt there is a simple way to avoid downloading the entire dataset I'm afraid.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With