In TensorFlow's new set of input pipeline functions, there is an ability to group sets of records together using the "group_by_window" function. It is described in the documentation here: https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#group_by_window I don't fully understand the explanation here used to describe the function, and I tend to learn best by example. I can't find any example code anywhere on the internet for this function. Could someone please whip up a barebones and runnable example of this function to show how it works, and what to give this function?

For tensorflow version 1.9.0 Here is a quick example I could come up with: <pre class="prettyprint"><code>import tensorflow as tf import numpy as np components = np.arange(100).astype(np.int64) dataset = tf.data.Dataset.from_tensor_slices(components) dataset = dataset.apply(tf.contrib.data.group_by_window(key_func=lambda x: x%2, reduce_func=lambda _, els: els.batch(10), window_size=100) iterator = dataset.make_one_shot_iterator() features = iterator.get_next() sess = tf.Session() sess.run(features) # array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18], dtype=int64) </code></pre> The first argument <code>key_func</code> maps every element in the dataset to a key. The <code>window_size</code> defines the bucket size that is given to the <code>reduce_fund</code>. In the <code>reduce_func</code> you receive a block of <code>window_size</code> elements. You can shuffle, batch or pad however you want. EDIT for dynamic padding and bucketing using the group_by_window fucntion more here : If you have a <code>tf.contrib.dataset</code> which holds <code>(sequence, sequence_length, label)</code> and sequence is a tensor of tf.int64: <pre class="prettyprint"><code>def bucketing_fn(sequence_length, buckets): """Given a sequence_length returns a bucket id""" t = tf.clip_by_value(buckets, 0, sequence_length) return tf.argmax(t) def reduc_fn(key, elements, window_size): """Receives `window_size` elements""" return elements.shuffle(window_size, seed=0) # Create buckets from 0 to 500 with an increment of 15 -> [0, 15, 30, ... , 500] buckets = [tf.constant(num, dtype=tf.int64) for num in range(0, 500, 15) window_size = 1000 # Bucketing dataset = dataset.group_by_window( lambda x, y, z: bucketing_fn(x, buckets), lambda key, x: reduc_fn(key, x, window_size), window_size) # You could pad it in the reduc_func, but I'll do it here for clarity # The last element of the dataset is the dynamic sentences. By giving it tf.Dimension(None) it will pad the sencentences (with 0) according to the longest sentence. dataset = dataset.padded_batch(batch_size, padded_shapes=( tf.TensorShape([]), tf.TensorShape([]), tf.Dimension(None))) dataset = dataset.repeat(num_epochs) iterator = dataset.make_one_shot_iterator() features = iterator.get_next() </code></pre>

How do I use the "group_by_window" function in TensorFlow

Tags:

python

tensorflow

tensorflow-datasets

In TensorFlow's new set of input pipeline functions, there is an ability to group sets of records together using the "group_by_window" function. It is described in the documentation here:

https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#group_by_window

I don't fully understand the explanation here used to describe the function, and I tend to learn best by example. I can't find any example code anywhere on the internet for this function. Could someone please whip up a barebones and runnable example of this function to show how it works, and what to give this function?

331

asked Jul 25 '17 01:07

John Scolaro

1 Answers

For tensorflow version 1.9.0 Here is a quick example I could come up with:

import tensorflow as tf
import numpy as np
components = np.arange(100).astype(np.int64)
dataset = tf.data.Dataset.from_tensor_slices(components)
dataset = dataset.apply(tf.contrib.data.group_by_window(key_func=lambda x: x%2, reduce_func=lambda _, els: els.batch(10), window_size=100)
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()
sess = tf.Session()
sess.run(features) # array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18], dtype=int64)

The first argument key_func maps every element in the dataset to a key.

The window_size defines the bucket size that is given to the reduce_fund.

In the reduce_func you receive a block of window_size elements. You can shuffle, batch or pad however you want.

EDIT for dynamic padding and bucketing using the group_by_window fucntion more here :

If you have a tf.contrib.dataset which holds (sequence, sequence_length, label) and sequence is a tensor of tf.int64:

def bucketing_fn(sequence_length, buckets):
    """Given a sequence_length returns a bucket id"""
    t = tf.clip_by_value(buckets, 0, sequence_length)
    return tf.argmax(t)

def reduc_fn(key, elements, window_size):
    """Receives `window_size` elements"""
    return elements.shuffle(window_size, seed=0)
# Create buckets from 0 to 500 with an increment of 15 -> [0, 15, 30, ... , 500]
buckets = [tf.constant(num, dtype=tf.int64) for num in range(0, 500, 15)
window_size = 1000
# Bucketing
dataset = dataset.group_by_window(
        lambda x, y, z: bucketing_fn(x, buckets), 
        lambda key, x: reduc_fn(key, x, window_size), window_size)
# You could pad it in the reduc_func, but I'll do it here for clarity
# The last element of the dataset is the dynamic sentences. By giving it tf.Dimension(None) it will pad the sencentences (with 0) according to the longest sentence.
dataset = dataset.padded_batch(batch_size, padded_shapes=(
        tf.TensorShape([]), tf.TensorShape([]), tf.Dimension(None)))
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()

answered Sep 19 '22 05:09

Maxime De Bruyn

Related questions
                            
                                Airflow not scheduling Correctly Python
                            
                                How to limit query results with Django Rest filters
                            
                                pandas iterrows changes ints into floats
                            
                                Making a Jupyter notebook output cell fullscreen
                            
                                _pickle.UnpicklingError: could not find MARK
                            
                                Why iterator is considered functional-style in the Python documentation?
                            
                                Tips for properly using large broadcast variables?
                            
                                Custom describe or aggregate without groupby
                            
                                Why do we import scikit-learn with sklearn?
                            
                                Why limit DB Connection Pool Size in SQLAlchemy?
                            
                                Merge dataframes on nearest datetime / timestamp
                            
                                How to apply decorators to Cython cpdef functions
                            
                                Django : ProgrammingError: column "id" does not exist
                            
                                PDFminer: PDFTextExtractionNotAllowed Error
                            
                                cosine similarity on large sparse matrix with numpy
                            
                                Why does the symbol '{' remain when f"\{10}" is evaluated in Python 3.6?
                            
                                How to obscure a line behind a surface plot in matplotlib?
                            
                                having trouble installing awslogs agent
                            
                                Python Version Numbering Scheme
                            
                                Do something at the beginning & end of methods

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With