Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use the "group_by_window" function in TensorFlow

In TensorFlow's new set of input pipeline functions, there is an ability to group sets of records together using the "group_by_window" function. It is described in the documentation here:

https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#group_by_window

I don't fully understand the explanation here used to describe the function, and I tend to learn best by example. I can't find any example code anywhere on the internet for this function. Could someone please whip up a barebones and runnable example of this function to show how it works, and what to give this function?

like image 331
John Scolaro Avatar asked Jul 25 '17 01:07

John Scolaro


People also ask

What does TF data dataset From_tensor_slices do?

With that knowledge, from_tensors makes a dataset where each input tensor is like a row of your dataset, and from_tensor_slices makes a dataset where each input tensor is column of your data; so in the latter case all tensors must be the same length, and the elements (rows) of the resulting dataset are tuples with one ...

What is the role of the TF Data API in Tensorflow?

The tf. data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training.

Can’t use window functions in group by?

It’s true – the SQL order of operations means window functions aren’t allowed in GROUP BY. But you can use them anyway with this special trick. SQL window functions are one of the language’s most powerful features. However, the syntax of window functions is not easy to master. It has lots of details that can cause beginners to stumble.

What is the difference between SQL Window functions and group by?

SQL Window Functions vs. GROUP BY: What’s the Difference? A very common misconception among SQL users is that there is not much difference between SQL window functions and aggregate functions or the GROUP BY clause. However, the differences are very significant.

What is the group by clause in SQL Server?

The GROUP BY clause allows us to group a set of records based on some criteria and apply a function (e.g. AVG or MAX) to each group, obtaining one result for each group of records. Let’s see an example. We have a table called employee with a total of five employees and three departments:

How do I refer to window functions in SQL?

In practice, you can only directly refer to SQL window functions in the SELECT and ORDER BY clauses. Want to learn about window functions? Click here for a great interactive experience!


1 Answers

For tensorflow version 1.9.0 Here is a quick example I could come up with:

import tensorflow as tf
import numpy as np
components = np.arange(100).astype(np.int64)
dataset = tf.data.Dataset.from_tensor_slices(components)
dataset = dataset.apply(tf.contrib.data.group_by_window(key_func=lambda x: x%2, reduce_func=lambda _, els: els.batch(10), window_size=100)
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()
sess = tf.Session()
sess.run(features) # array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18], dtype=int64)

The first argument key_func maps every element in the dataset to a key.

The window_size defines the bucket size that is given to the reduce_fund.

In the reduce_func you receive a block of window_size elements. You can shuffle, batch or pad however you want.

EDIT for dynamic padding and bucketing using the group_by_window fucntion more here :

If you have a tf.contrib.dataset which holds (sequence, sequence_length, label) and sequence is a tensor of tf.int64:

def bucketing_fn(sequence_length, buckets):
    """Given a sequence_length returns a bucket id"""
    t = tf.clip_by_value(buckets, 0, sequence_length)
    return tf.argmax(t)

def reduc_fn(key, elements, window_size):
    """Receives `window_size` elements"""
    return elements.shuffle(window_size, seed=0)
# Create buckets from 0 to 500 with an increment of 15 -> [0, 15, 30, ... , 500]
buckets = [tf.constant(num, dtype=tf.int64) for num in range(0, 500, 15)
window_size = 1000
# Bucketing
dataset = dataset.group_by_window(
        lambda x, y, z: bucketing_fn(x, buckets), 
        lambda key, x: reduc_fn(key, x, window_size), window_size)
# You could pad it in the reduc_func, but I'll do it here for clarity
# The last element of the dataset is the dynamic sentences. By giving it tf.Dimension(None) it will pad the sencentences (with 0) according to the longest sentence.
dataset = dataset.padded_batch(batch_size, padded_shapes=(
        tf.TensorShape([]), tf.TensorShape([]), tf.Dimension(None)))
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()
like image 71
Maxime De Bruyn Avatar answered Sep 19 '22 05:09

Maxime De Bruyn