For the application, such as pair text similarity, the input data is similar to: pair_1, pair_2
. In these problems, we usually have multiple input data. Previously, I implemented my models successfully:
model.fit([pair_1, pair_2], labels, epochs=50)
I decided to replace my input pipeline with tf.data API. To this end, I create a Dataset similar to:
dataset = tf.data.Dataset.from_tensor_slices((pair_1, pair2, labels))
It compiles successfully but when start to train it throws the following exception:
AttributeError: 'tuple' object has no attribute 'ndim'
My Keras and Tensorflow version respectively are 2.1.6
and 1.11.0
. I found a similar issue in Tensorflow repository: tf.keras multi-input models don't work when using tf.data.Dataset.
Does anyone know how to fix the issue?
Here is some main part of the code:
(q1_test, q2_test, label_test) = test (q1_train, q2_train, label_train) = train def tfdata_generator(sent1, sent2, labels, is_training): '''Construct a data generator using tf.Dataset''' dataset = tf.data.Dataset.from_tensor_slices((sent1, sent2, labels)) if is_training: dataset = dataset.shuffle(1000) # depends on sample size dataset = dataset.repeat() dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE) return dataset train_dataset = tfdata_generator(q1_train, q2_train, label_train, is_training=True, batch_size=_BATCH_SIZE) test_dataset = tfdata_generator(q1_test, q2_test, label_test, is_training=False, batch_size=_BATCH_SIZE) inps1 = keras.layers.Input(shape=(50,)) inps2 = keras.layers.Input(shape=(50,)) embed = keras.layers.Embedding(input_dim=nb_vocab, output_dim=300, weights=[embedding], trainable=False) embed1 = embed(inps1) embed2 = embed(inps2) gru = keras.layers.CuDNNGRU(256) gru1 = gru(embed1) gru2 = gru(embed2) concat = keras.layers.concatenate([gru1, gru2]) preds = keras.layers.Dense(1, 'sigmoid')(concat) model = keras.models.Model(inputs=[inps1, inps2], outputs=preds) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) print(model.summary()) model.fit( train_dataset.make_one_shot_iterator(), steps_per_epoch=len(q1_train) // _BATCH_SIZE, epochs=50, validation_data=test_dataset.make_one_shot_iterator(), validation_steps=len(q1_test) // _BATCH_SIZE, verbose=1)
Keras is able to handle multiple inputs (and even multiple outputs) via its functional API. Learn more about 3 ways to create a Keras model with TensorFlow 2.0 (Sequential, Functional, and Model Subclassing).
With that knowledge, from_tensors makes a dataset where each input tensor is like a row of your dataset, and from_tensor_slices makes a dataset where each input tensor is column of your data; so in the latter case all tensors must be the same length, and the elements (rows) of the resulting dataset are tuples with one ...
If a TensorFlow operation has both CPU and GPU implementations, by default, the GPU device is prioritized when the operation is assigned. For example, tf. matmul has both CPU and GPU kernels and on a system with devices CPU:0 and GPU:0 , the GPU:0 device is selected to run tf.
prefetch transformation. It can be used to decouple the time when data is produced from the time when data is consumed. In particular, the transformation uses a background thread and an internal buffer to prefetch elements from the input dataset ahead of the time they are requested.
I'm not using Keras but I would go with an tf.data.Dataset.from_generator() - like:
def _input_fn(): sent1 = np.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=np.int64) sent2 = np.array([20, 25, 35, 40, 600, 30, 20, 30], dtype=np.int64) sent1 = np.reshape(sent1, (8, 1, 1)) sent2 = np.reshape(sent2, (8, 1, 1)) labels = np.array([40, 30, 20, 10, 80, 70, 50, 60], dtype=np.int64) labels = np.reshape(labels, (8, 1)) def generator(): for s1, s2, l in zip(sent1, sent2, labels): yield {"input_1": s1, "input_2": s2}, l dataset = tf.data.Dataset.from_generator(generator, output_types=({"input_1": tf.int64, "input_2": tf.int64}, tf.int64)) dataset = dataset.batch(2) return dataset ... model.fit(_input_fn(), epochs=10, steps_per_epoch=4)
This generator can iterate over your e.g text-files / numpy arrays and yield on every call a example. In this example, I assume that the word of the sentences are already converted to the indices in the vocabulary.
Edit: Since OP asked, it should be also possible with Dataset.from_tensor_slices()
:
def _input_fn(): sent1 = np.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=np.int64) sent2 = np.array([20, 25, 35, 40, 600, 30, 20, 30], dtype=np.int64) sent1 = np.reshape(sent1, (8, 1)) sent2 = np.reshape(sent2, (8, 1)) labels = np.array([40, 30, 20, 10, 80, 70, 50, 60], dtype=np.int64) labels = np.reshape(labels, (8)) dataset = tf.data.Dataset.from_tensor_slices(({"input_1": sent1, "input_2": sent2}, labels)) dataset = dataset.batch(2, drop_remainder=True) return dataset
One way to solve your issue could be to use the zip
dataset to combine your various inputs:
sent1 = np.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=np.float32) sent2 = np.array([20, 25, 35, 40, 600, 30, 20, 30], dtype=np.float32) sent1 = np.reshape(sent1, (8, 1, 1)) sent2 = np.reshape(sent2, (8, 1, 1)) labels = np.array([40, 30, 20, 10, 80, 70, 50, 60], dtype=np.float32) labels = np.reshape(labels, (8, 1)) dataset_12 = tf.data.Dataset.from_tensor_slices((sent_1, sent_2)) dataset_label = tf.data.Dataset.from_tensor_slices(labels) dataset = tf.data.Dataset.zip((dataset_12, dataset_label)).batch(2).repeat() model.fit(dataset, epochs=10, steps_per_epoch=4)
will print: Epoch 1/10 4/4 [==============================] - 2s 503ms/step...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With