TensorFlow tf.data.Dataset and bucketing

Tags:

For an LSTM network, I've seen great improvements with bucketing.

I've come across the bucketing section in the TensorFlow docs which (tf.contrib).

Though in my network, I am using the tf.data.Dataset API, specifically I'm working with TFRecords, so my input pipeline looks something like this

dataset = tf.data.TFRecordDataset(TFRECORDS_PATH)
dataset = dataset.map(_parse_function)
dataset = dataset.map(_scale_function)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.padded_batch(batch_size, padded_shapes={.....})

How can I incorporate the bucketing method into a the tf.data.Dataset pipeline?

If it matters, in every record in the TFRecords file I have the sequence length saved as an integer.

762

asked May 30 '18 13:05

bluesummers

1 Answers

Various bucketing use cases using Dataset API are explained well here.

bucket_by_sequence_length() example:

def elements_gen():
   text = [[1, 2, 3], [3, 4, 5, 6, 7], [1, 2], [8, 9, 0, 2]]
   label = [1, 2, 1, 2]
   for x, y in zip(text, label):
       yield (x, y)

def element_length_fn(x, y):
   return tf.shape(x)[0]

dataset = tf.data.Dataset.from_generator(generator=elements_gen,
                                     output_shapes=([None],[]),
                                     output_types=(tf.int32, tf.int32))

dataset =   dataset.apply(tf.contrib.data.bucket_by_sequence_length(element_length_func=element_length_fn,
                                                              bucket_batch_sizes=[2, 2, 2],
                                                              bucket_boundaries=[0, 8]))

batch = dataset.make_one_shot_iterator().get_next()

with tf.Session() as sess:

   for _ in range(2):
      print('Get_next:')
      print(sess.run(batch))

Output:

Get_next:
(array([[1, 2, 3, 0, 0],
   [3, 4, 5, 6, 7]], dtype=int32), array([1, 2], dtype=int32))
Get_next:
(array([[1, 2, 0, 0],
   [8, 9, 0, 2]], dtype=int32), array([1, 2], dtype=int32))

answered Oct 04 '22 00:10

vijay m

Related questions
                            
                                FastText - Cannot load model.bin due to C++ extension failed to allocate the memory
                            
                                Why does df.apply(tuple) work but not df.apply(list)?
                            
                                Finding the union of multiple overlapping rectangles - OpenCV python
                            
                                Is it possible to parallelize bz2's decompression?
                            
                                mypy: Signature of "__getitem__" incompatible with supertype "Sequence"
                            
                                Python : How to interpret the result of logistic regression by sm.Logit
                            
                                TensorFlow estimator.predict() gives WARNING:tensorflow:Input graph does not contain a QueueRunner
                            
                                TypeError: unsupported operand type(s) for +: 'set' and 'set'
                            
                                Spark/PySpark: An error occurred while trying to connect to the Java server (127.0.0.1:39543)
                            
                                Writing results from SQL query to CSV and avoiding extra line-breaks
                            
                                Selecting an element on Appium / Android with Python that has same Class and Same Index of another element on UIAutomatorViewer
                            
                                Django app : unit tests fails because of django.db.utils.IntegrityError
                            
                                How to get the co-ordinates of the text recogonized from Image using OCR in python
                            
                                Adding Tensorboard summaries from graph ops generated inside Dataset map() function calls
                            
                                How to upgrade django project multiple versions (1.8 to 1.11+)?
                            
                                Unable to convert Kafka topic data into structured JSON with Confluent Elasticsearch sink connector
                            
                                Does the TensorFlow backend of Keras rely on the eager execution?
                            
                                Storing multiple dataframes of different widths with Parquet?
                            
                                Jupyter commands work only with a dash (e.g. jupyter-kernelspec instead of jupyter kernelspec)
                            
                                Groupby search first and last True values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

TensorFlow tf.data.Dataset and bucketing

Tags:

python

tensorflow

tensorflow-datasets

bluesummers

People also ask

1 Answers

vijay m

Recent Activity

Donate For Us