Get length of a dataset in Tensorflow

Tags:

source_dataset = tf.data.TextLineDataset('primary.csv')
target_dataset = tf.data.TextLineDataset('secondary.csv')
dataset = tf.data.Dataset.zip((source_dataset, target_dataset))
dataset = dataset.shard(10000, 0)
dataset = dataset.map(lambda source, target: (tf.string_to_number(tf.string_split([source], delimiter=',').values, tf.int32),
                                              tf.string_to_number(tf.string_split([target], delimiter=',').values, tf.int32)))
dataset = dataset.map(lambda source, target: (source, tf.concat(([start_token], target), axis=0), tf.concat((target, [end_token]), axis=0)))
dataset = dataset.map(lambda source, target_in, target_out: (source, tf.size(source), target_in, target_out, tf.size(target_in)))

dataset = dataset.shuffle(NUM_SAMPLES)  #This is the important line of code

I would like to shuffle my entire dataset fully, but shuffle() requires a number of samples to pull, and tf.Size() does not work with tf.data.Dataset.

How can I shuffle properly?

229

asked Dec 10 '17 04:12

Evan Weissburg

2 Answers

I was working with tf.data.FixedLengthRecordDataset() and ran into a similar problem. In my case, I was trying to only take a certain percentage of the raw data. Since I knew all the records have a fixed length, a workaround for me was:

totalBytes = sum([os.path.getsize(os.path.join(filepath, filename)) for filename in os.listdir(filepath)])
numRecordsToTake = tf.cast(0.01 * percentage * totalBytes / bytesPerRecord, tf.int64)
dataset = tf.data.FixedLengthRecordDataset(filenames, recordBytes).take(numRecordsToTake)

In your case, my suggestion would be to count directly in python the number of records in 'primary.csv' and 'secondary.csv'. Alternatively, I think for your purpose, to set the buffer_size argument doesn't really require counting the files. According to the accepted answer about the meaning of buffer_size, a number that's greater than the number of elements in the dataset will ensure a uniform shuffle across the whole dataset. So just putting in a really big number (that you think will surpass the dataset size) should work.

answered Sep 28 '22 00:09

Ringo

As of TensorFlow 2, the length of the dataset can be easily retrieved by means of the cardinality() function.

dataset = tf.data.Dataset.range(42)
#both print 42 
dataset_length_v1 = tf.data.experimental.cardinality(dataset).numpy())
dataset_length_v2 = dataset.cardinality().numpy()

NOTE: When using predicates, such as filter, the return of the length may be -2. One can consult an explanation here, otherwise just read the following paragraph:

If you use the filter predicate, the cardinality may return value -2, hence unknown; if you do use filter predicates on your dataset, ensure that you have calculated in another manner the length of your dataset( for example length of pandas dataframe before applying .from_tensor_slices() on it.

answered Sep 28 '22 00:09

Timbus Calin

Related questions
                            
                                Extract text from a scanned pdf with images?
                            
                                How to preserve milliseconds when converting a date and time string to timestamp using PySpark?
                            
                                Install latest cairo lib in Ubuntu for weasyprint
                            
                                Yocto Warrior Bitbake Recipe for PyTorch for NVIDIA Jetson Nano
                            
                                Python3.7 ImportError: No module named 'django'
                            
                                infer_datetime_format with parse_date taking more time
                            
                                Works with urrlib.request but doesn't work with requests
                            
                                Get Instagram followers list with python script
                            
                                How to topological sort a sub/nested graph?
                            
                                why networkx.draw() produces nothing? [duplicate]
                            
                                Where should virtualenvs go in production?
                            
                                Why there's the difference between creating class in python 2.7 and python 3.4 performance
                            
                                Subclassing file by subclassing `io.TextIOWrapper` — but what signature does its constructor have?
                            
                                Prevent access to an instance variable from subclass, without affecting base class
                            
                                Pympler summary doesn't seem to make sense
                            
                                Python module import works for one file, fails for another
                            
                                Redshift + SQLAlchemy long query hangs
                            
                                Python: How to generate all combinations of lists of tuples without repeating contents of the tuple
                            
                                os.path.abspath vs os.path.dirname
                            
                                How do I distribute my pip package with data files correctly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get length of a dataset in Tensorflow

Tags:

python-3.x

tensorflow

dataset

tensorflow2.0

tensorflow2.x

Evan Weissburg

People also ask

2 Answers

Ringo

Timbus Calin

Recent Activity

Donate For Us