I have a long list of lists of integers (representing sentences, each one of different sizes) that I want to feed using the tf.data library. Each list (of the lists of list) has different length, and I get an error, which I can reproduce here:
t = [[4,2], [3,4,5]] dataset = tf.data.Dataset.from_tensor_slices(t)
The error I get is:
ValueError: Argument must be a dense tensor: [[4, 2], [3, 4, 5]] - got shape [2], but wanted [2, 2].
Is there a way to do this?
EDIT 1: Just to be clear, I don't want to pad the input list of lists (it's a list of sentences containing over a million elements, with varying lengths) I want to use the tf.data library to feed, in a proper way, a list of lists with varying length.
To iterate over the dataset several times, use . repeat() . We can enumerate each batch by using either Python's enumerator or a build-in method.
With that knowledge, from_tensors makes a dataset where each input tensor is like a row of your dataset, and from_tensor_slices makes a dataset where each input tensor is column of your data; so in the latter case all tensors must be the same length, and the elements (rows) of the resulting dataset are tuples with one ...
Dataset : repeat( count=0 ) The method repeats the dataset count number of times. shuffle( buffer_size, seed=None, reshuffle_each_iteration=None) The method shuffles the samples in the dataset.
tf. data builds a performance model of the input pipeline and runs an optimization algorithm to find a good allocation of its CPU budget across all parameters specified as AUTOTUNE .
You can use tf.data.Dataset.from_generator()
to convert any iterable Python object (like a list of lists) into a Dataset
:
t = [[4, 2], [3, 4, 5]] dataset = tf.data.Dataset.from_generator(lambda: t, tf.int32, output_shapes=[None]) iterator = dataset.make_one_shot_iterator() next_element = iterator.get_next() with tf.Session() as sess: print(sess.run(next_element)) # ==> '[4, 2]' print(sess.run(next_element)) # ==> '[3, 4, 5]'
For those working with TensorFlow 2 and looking for an answer I found the following to work directly with ragged tensors. which should be much faster than generator, as long as the entire dataset fits in memory.
t = [[[4,2]], [[3,4,5]]] rt=tf.ragged.constant(t) dataset = tf.data.Dataset.from_tensor_slices(rt) for x in dataset: print(x)
produces
<tf.RaggedTensor [[4, 2]]> <tf.RaggedTensor [[3, 4, 5]]>
For some reason, it's very particular about having at least 2 dimensions on the individual arrays.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With