Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to input a list of lists with different sizes in tf.data.Dataset

Tags:

I have a long list of lists of integers (representing sentences, each one of different sizes) that I want to feed using the tf.data library. Each list (of the lists of list) has different length, and I get an error, which I can reproduce here:

t = [[4,2], [3,4,5]] dataset = tf.data.Dataset.from_tensor_slices(t) 

The error I get is:

ValueError: Argument must be a dense tensor: [[4, 2], [3, 4, 5]] - got shape [2], but wanted [2, 2]. 

Is there a way to do this?

EDIT 1: Just to be clear, I don't want to pad the input list of lists (it's a list of sentences containing over a million elements, with varying lengths) I want to use the tf.data library to feed, in a proper way, a list of lists with varying length.

like image 201
Escachator Avatar asked Nov 30 '17 18:11

Escachator


People also ask

How do I iterate over a dataset in TF?

To iterate over the dataset several times, use . repeat() . We can enumerate each batch by using either Python's enumerator or a build-in method.

What does TF data dataset from_tensor_slices do?

With that knowledge, from_tensors makes a dataset where each input tensor is like a row of your dataset, and from_tensor_slices makes a dataset where each input tensor is column of your data; so in the latter case all tensors must be the same length, and the elements (rows) of the resulting dataset are tuples with one ...

What does Batch repeat and shuffle do with TensorFlow dataset?

Dataset : repeat( count=0 ) The method repeats the dataset count number of times. shuffle( buffer_size, seed=None, reshuffle_each_iteration=None) The method shuffles the samples in the dataset.

What is TF data experimental Autotune?

tf. data builds a performance model of the input pipeline and runs an optimization algorithm to find a good allocation of its CPU budget across all parameters specified as AUTOTUNE .


2 Answers

You can use tf.data.Dataset.from_generator() to convert any iterable Python object (like a list of lists) into a Dataset:

t = [[4, 2], [3, 4, 5]]  dataset = tf.data.Dataset.from_generator(lambda: t, tf.int32, output_shapes=[None])  iterator = dataset.make_one_shot_iterator() next_element = iterator.get_next()  with tf.Session() as sess:   print(sess.run(next_element))  # ==> '[4, 2]'   print(sess.run(next_element))  # ==> '[3, 4, 5]' 
like image 157
mrry Avatar answered Sep 18 '22 19:09

mrry


For those working with TensorFlow 2 and looking for an answer I found the following to work directly with ragged tensors. which should be much faster than generator, as long as the entire dataset fits in memory.

t = [[[4,2]],      [[3,4,5]]]  rt=tf.ragged.constant(t) dataset = tf.data.Dataset.from_tensor_slices(rt)  for x in dataset:   print(x) 

produces

<tf.RaggedTensor [[4, 2]]> <tf.RaggedTensor [[3, 4, 5]]> 

For some reason, it's very particular about having at least 2 dimensions on the individual arrays.

like image 37
FlashDD Avatar answered Sep 18 '22 19:09

FlashDD