In TensorFlow's new set of input pipeline functions, there is an ability to group sets of records together using the "group_by_window" function. It is described in the documentation here:
https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#group_by_window
I don't fully understand the explanation here used to describe the function, and I tend to learn best by example. I can't find any example code anywhere on the internet for this function. Could someone please whip up a barebones and runnable example of this function to show how it works, and what to give this function?
With that knowledge, from_tensors makes a dataset where each input tensor is like a row of your dataset, and from_tensor_slices makes a dataset where each input tensor is column of your data; so in the latter case all tensors must be the same length, and the elements (rows) of the resulting dataset are tuples with one ...
The tf. data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training.
It’s true – the SQL order of operations means window functions aren’t allowed in GROUP BY. But you can use them anyway with this special trick. SQL window functions are one of the language’s most powerful features. However, the syntax of window functions is not easy to master. It has lots of details that can cause beginners to stumble.
SQL Window Functions vs. GROUP BY: What’s the Difference? A very common misconception among SQL users is that there is not much difference between SQL window functions and aggregate functions or the GROUP BY clause. However, the differences are very significant.
The GROUP BY clause allows us to group a set of records based on some criteria and apply a function (e.g. AVG or MAX) to each group, obtaining one result for each group of records. Let’s see an example. We have a table called employee with a total of five employees and three departments:
In practice, you can only directly refer to SQL window functions in the SELECT and ORDER BY clauses. Want to learn about window functions? Click here for a great interactive experience!
For tensorflow version 1.9.0 Here is a quick example I could come up with:
import tensorflow as tf
import numpy as np
components = np.arange(100).astype(np.int64)
dataset = tf.data.Dataset.from_tensor_slices(components)
dataset = dataset.apply(tf.contrib.data.group_by_window(key_func=lambda x: x%2, reduce_func=lambda _, els: els.batch(10), window_size=100)
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()
sess = tf.Session()
sess.run(features) # array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18], dtype=int64)
The first argument key_func
maps every element in the dataset to a key.
The window_size
defines the bucket size that is given to the reduce_fund
.
In the reduce_func
you receive a block of window_size
elements. You can shuffle, batch or pad however you want.
EDIT for dynamic padding and bucketing using the group_by_window fucntion more here :
If you have a tf.contrib.dataset
which holds (sequence, sequence_length, label)
and sequence is a tensor of tf.int64:
def bucketing_fn(sequence_length, buckets):
"""Given a sequence_length returns a bucket id"""
t = tf.clip_by_value(buckets, 0, sequence_length)
return tf.argmax(t)
def reduc_fn(key, elements, window_size):
"""Receives `window_size` elements"""
return elements.shuffle(window_size, seed=0)
# Create buckets from 0 to 500 with an increment of 15 -> [0, 15, 30, ... , 500]
buckets = [tf.constant(num, dtype=tf.int64) for num in range(0, 500, 15)
window_size = 1000
# Bucketing
dataset = dataset.group_by_window(
lambda x, y, z: bucketing_fn(x, buckets),
lambda key, x: reduc_fn(key, x, window_size), window_size)
# You could pad it in the reduc_func, but I'll do it here for clarity
# The last element of the dataset is the dynamic sentences. By giving it tf.Dimension(None) it will pad the sencentences (with 0) according to the longest sentence.
dataset = dataset.padded_batch(batch_size, padded_shapes=(
tf.TensorShape([]), tf.TensorShape([]), tf.Dimension(None)))
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With