If I have the following data and want to use StringLookup
for preprocessing:
x = pd.DataFrame({'col1': list('abcdefghij'), 'col2': np.arange(10), 'col3': np.arange(10)})
y = np.arange(10)
First, I need to transform my windowed dataset to a dictionary of tensors as the model expects tensors as input (maybe there are better ways to do it?):
window_size = 3
dataset = tf.data.Dataset.from_tensor_slices((dict(x), y)).window(window_size, shift=1, drop_remainder=True)
# Extra preprocessing to get dict of tensors
dataset = dataset.flat_map(
lambda x, y: tf.data.Dataset.zip(({k: v.batch(window_size) for k, v in x.items()}, y.batch(window_size)))
)
dataset = dataset.batch(3)
for i, j in dataset.take(1):
print(i, j)
Output:
{'col1': <tf.Tensor: shape=(3, 3), dtype=string, numpy=
array([[b'a', b'b', b'c'],
[b'b', b'c', b'd'],
[b'c', b'd', b'e']], dtype=object)>, 'col2': <tf.Tensor: shape=(3, 3), dtype=int64, numpy=
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])>, 'col3': <tf.Tensor: shape=(3, 3), dtype=int64, numpy=
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])>} tf.Tensor(
[[0 1 2]
[1 2 3]
[2 3 4]], shape=(3, 3), dtype=int64)
Create preprocessor for different dtypes like in this example:
inputs = {'col1': tf.keras.Input(shape=(), name='col1', dtype=tf.string),
'col2': tf.keras.Input(shape=(), name='col2', dtype=tf.float32),
'col3': tf.keras.Input(shape=(), name='col3', dtype=tf.float32)}
vocab = sorted(set(x['col1']))
lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')
lookup = lookup(inputs['col1'][:tf.newaxis])
numeric = tf.stack([tf.cast(inputs[i], dtype=tf.float32) for i in ['col2', 'col3']], axis=-1)
result = tf.concat([lookup, numeric], axis=-1)
preprocessor = tf.keras.Model(inputs, result)
# Test preprocessor
preprocessor(dict(x))
Output:
<tf.Tensor: shape=(10, 13), dtype=float32, numpy=
array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.],
[0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 2., 2.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 3., 3.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 4., 4.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 5., 5.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 6., 6.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 7., 7.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 8., 8.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 9., 9.]],
dtype=float32)>
Create model:
body = tf.keras.models.Sequential([tf.keras.layers.Dense(8),
tf.keras.layers.Dense(window_size)])
x = preprocessor(inputs)
result = body(x)
model = tf.keras.Model(inputs, result)
model.summary()
Output:
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
col1 (InputLayer) [(None,)] 0 []
col2 (InputLayer) [(None,)] 0 []
col3 (InputLayer) [(None,)] 0 []
model_35 (Functional) (None, 13) 0 ['col1[0][0]',
'col2[0][0]',
'col3[0][0]']
sequential_19 (Sequential) (None, 3) 139 ['model_35[2][0]']
==================================================================================================
Total params: 139
Trainable params: 139
Non-trainable params: 0
__________________________________________________________________________________________________
Compile and train:
model.compile(loss='mae', optimizer='adam')
model.fit(dataset)
Error:
ValueError: Exception encountered when calling layer "string_lookup_24" (type StringLookup).
When output_mode is not `'int'`, maximum supported output rank is 2. Received output_mode one_hot and input shape (None, None), which would result in output rank 3.
Call arguments received:
• inputs=tf.Tensor(shape=(None, None), dtype=string)
How should I build my preprocessor or preprocess my dataset to make it work? Thank you!
Something like this should work for you:
import tensorflow as tf
import numpy as np
import pandas as pd
x = pd.DataFrame({'col1': list('abcdefghij'), 'col2': np.arange(10), 'col3': np.arange(10)})
y = np.arange(10)
window_size = 3
dataset = tf.data.Dataset.from_tensor_slices((dict(x), y)).window(window_size, shift=1, drop_remainder=True)
# Extra preprocessing to get dict of tensors
dataset = dataset.flat_map(lambda window_x, window_y: tf.data.Dataset.zip({**{k: v.batch(window_size) for k, v in window_x.items()}, **{"y": window_y.batch(window_size)}}))
dataset = dataset.map(lambda data_dict: ({k: v for k, v in data_dict.items() if k != 'y'}, data_dict["y"]))
vocab = sorted(set(x['col1']))
lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')
dataset = dataset.map(lambda i, j: ({'col1': lookup(i['col1']), 'col2': i['col2'], 'col3': i['col3']}, j)).batch(3)
Your model:
inputs = {'col1': tf.keras.Input(shape=(window_size, lookup.vocabulary_size()), name='col1', dtype=tf.float32),
'col2': tf.keras.Input(shape=(3,), name='col2', dtype=tf.float32),
'col3': tf.keras.Input(shape=(3,), name='col3', dtype=tf.float32)}
numeric = tf.stack([inputs['col2'], inputs['col2']], axis=-1)
result = tf.concat([inputs['col1'], numeric], axis=-1)
preprocessor = tf.keras.Model(inputs, result)
body = tf.keras.models.Sequential([tf.keras.layers.Flatten(),
tf.keras.layers.Dense(8),
tf.keras.layers.Dense(window_size)])
x = preprocessor(inputs)
result = body(x)
model = tf.keras.Model(inputs, result)
model.summary()
model.compile(loss='mae', optimizer='adam')
model.fit(dataset)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With