I understand Dataset API is a sort of a iterator which does not load the entire dataset into memory, because of which it is unable to find the size of the Dataset. I am talking in the context of large corpus of data that is stored in text files or tfRecord files. These files are generally read using <code>tf.data.TextLineDataset</code> or something similar. It is trivial to find the size of dataset loaded using <code>tf.data.Dataset.from_tensor_slices</code>. The reason I am asking the size of the Dataset is the following: Let's say my Dataset size is 1000 elements. Batch size = 50 elements. Then training steps/batches (assuming 1 epoch) = 20. During these 20 steps, I would like to exponentially decay my learning rate from 0.1 to 0.01 as <pre class="prettyprint"><code>tf.train.exponential_decay( learning_rate = 0.1, global_step = global_step, decay_steps = 20, decay_rate = 0.1, staircase=False, name=None ) </code></pre> In the above code, I have "and" would like to set <code>decay_steps = number of steps/batches per epoch = num_elements/batch_size</code>. This can be calculated only if the number of elements in the dataset is known in advance. Another reason to know the size in advance is to split the data into train and test sets using <code>tf.data.Dataset.take()</code>, <code>tf.data.Dataset.skip()</code> methods. PS: I am not looking for brute-force approaches like iterating through the whole dataset and updating a counter to count the number of elements or putting a very large batch size and then finding the size of the resultant dataset, etc.

I realize this question is two years old, but perhaps this answer will be useful. If you are reading your data with <code>tf.data.TextLineDataset</code>, then a way to get the number of samples could be to count the number of lines in all of the text files you are using. Consider the following example: <pre class="prettyprint"><code>import random import string import tensorflow as tf filenames = ["data0.txt", "data1.txt", "data2.txt"] # Generate synthetic data. for filename in filenames: with open(filename, "w") as f: lines = [random.choice(string.ascii_letters) for _ in range(random.randint(10, 100))] print("\n".join(lines), file=f) dataset = tf.data.TextLineDataset(filenames) </code></pre> Trying to get the length with <code>len</code> raises a <code>TypeError</code>: <pre class="prettyprint"><code>len(dataset) </code></pre> But one can calculate the number of lines in a file relatively quickly. <pre class="prettyprint"><code># https://stackoverflow.com/q/845058/5666087 def get_n_lines(filepath): i = -1 with open(filepath) as f: for i, _ in enumerate(f): pass return i + 1 n_lines = sum(get_n_lines(f) for f in filenames) </code></pre> In the above, <code>n_lines</code> is equal to the number of elements found when iterating over the dataset with <pre class="prettyprint"><code>for i, _ in enumerate(dataset): pass n_lines == i + 1 </code></pre>

Tensorflow: How to find the size of a tf.data.Dataset API object

Tags:

python

tensorflow

tensorflow-datasets

I understand Dataset API is a sort of a iterator which does not load the entire dataset into memory, because of which it is unable to find the size of the Dataset. I am talking in the context of large corpus of data that is stored in text files or tfRecord files. These files are generally read using tf.data.TextLineDataset or something similar. It is trivial to find the size of dataset loaded using tf.data.Dataset.from_tensor_slices.

The reason I am asking the size of the Dataset is the following: Let's say my Dataset size is 1000 elements. Batch size = 50 elements. Then training steps/batches (assuming 1 epoch) = 20. During these 20 steps, I would like to exponentially decay my learning rate from 0.1 to 0.01 as

tf.train.exponential_decay(
    learning_rate = 0.1,
    global_step = global_step,
    decay_steps = 20,
    decay_rate = 0.1,
    staircase=False,
    name=None
)

In the above code, I have "and" would like to set decay_steps = number of steps/batches per epoch = num_elements/batch_size. This can be calculated only if the number of elements in the dataset is known in advance.

Another reason to know the size in advance is to split the data into train and test sets using tf.data.Dataset.take(), tf.data.Dataset.skip() methods.

PS: I am not looking for brute-force approaches like iterating through the whole dataset and updating a counter to count the number of elements or putting a very large batch size and then finding the size of the resultant dataset, etc.

226

asked Jun 19 '18 01:06

omsrisagar

2 Answers

You can easily get the number of data samples using :

dataset.__len__()

You can get each element like this:

for step, element in enumerate(dataset.as_numpy_iterator()):
...     print(step, element)

You can also get the shape of one sample:

dataset.element_spec

If you want to take specific elements you can use shard method as well.

answered Oct 18 '22 13:10

Ixtiyor Majidov

I realize this question is two years old, but perhaps this answer will be useful.

If you are reading your data with tf.data.TextLineDataset, then a way to get the number of samples could be to count the number of lines in all of the text files you are using.

Consider the following example:

import random
import string
import tensorflow as tf

filenames = ["data0.txt", "data1.txt", "data2.txt"]

# Generate synthetic data.
for filename in filenames:
    with open(filename, "w") as f:
        lines = [random.choice(string.ascii_letters) for _ in range(random.randint(10, 100))]
        print("\n".join(lines), file=f)

dataset = tf.data.TextLineDataset(filenames)

Trying to get the length with len raises a TypeError:

len(dataset)

But one can calculate the number of lines in a file relatively quickly.

# https://stackoverflow.com/q/845058/5666087
def get_n_lines(filepath):
    i = -1
    with open(filepath) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

n_lines = sum(get_n_lines(f) for f in filenames)

In the above, n_lines is equal to the number of elements found when iterating over the dataset with

for i, _ in enumerate(dataset):
    pass
n_lines == i + 1

answered Oct 18 '22 15:10

jakub

Related questions
                            
                                Compile and use python-openzwave with open-zwave in non-standard location
                            
                                Python Fit ellipse to an image
                            
                                Use requests module in Python to log in to Barclays premier league fantasy football?
                            
                                Conda (Python) Virtual Environment is not Portable from Windows to Linux
                            
                                Sentence similarity using keras
                            
                                How to run python production on customer environment
                            
                                Start python debugger in oldest stack frame after an exception occurs
                            
                                vscode pylint on all project files without opening them
                            
                                Tensorflow on Android with Python bindings?
                            
                                Aiohttp ClientSession outside coroutine
                            
                                Django: running code on every startup but after database is migrated
                            
                                worker does not consume tasks after celery add_consumer is called
                            
                                How can I install matplotlib without installing Qt using conda on Windows?
                            
                                Do I use the same Tfidf vocabulary in k-fold cross_validation
                            
                                Comparing Numpy and Matlab array summation speed
                            
                                Getting video properties with Python without calling external software
                            
                                Flask request.get_json() returning None when valid json data sent via post request [duplicate]
                            
                                Sort a list of two-sided items based on the similarity of consecutive items
                            
                                Python module "trace": file path missing
                            
                                Is data augmentation in Keras applied to the validation set when using ImageDataGenerator and flow_from_directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With