Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save TextVectorization to disk in tensorflow?

I have trained a TextVectorization layer (see below), and I want to save it to disk, so that I can reload it next time? I have tried pickle and joblib.dump(). It does not work.

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization 

text_dataset = tf.data.Dataset.from_tensor_slices(text_clean) 
    
vectorizer = TextVectorization(max_tokens=100000, output_mode='tf-idf',ngrams=None)
    
vectorizer.adapt(text_dataset.batch(1024))

The generated error is the following:

InvalidArgumentError: Cannot convert a Tensor of dtype resource to a NumPy array

How can I save it?

like image 239
yanachen Avatar asked Feb 22 '26 16:02

yanachen


2 Answers

Instead of pickling the object, pickle the configuration and weights. Later unpickle it and use configuration to create the object and load the saved weights. Official docs here.

Code

text_dataset = tf.data.Dataset.from_tensor_slices([
                                                   "this is some clean text", 
                                                   "some more text", 
                                                   "even some more text"]) 
# Fit a TextVectorization layer
vectorizer = TextVectorization(max_tokens=10, output_mode='tf-idf',ngrams=None)    
vectorizer.adapt(text_dataset.batch(1024))

# Vector for word "this"
print (vectorizer("this"))

# Pickle the config and weights
pickle.dump({'config': vectorizer.get_config(),
             'weights': vectorizer.get_weights()}
            , open("tv_layer.pkl", "wb"))

print ("*"*10)
# Later you can unpickle and use 
# `config` to create object and 
# `weights` to load the trained weights. 

from_disk = pickle.load(open("tv_layer.pkl", "rb"))
new_v = TextVectorization.from_config(from_disk['config'])
# You have to call `adapt` with some dummy data (BUG in Keras)
new_v.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_v.set_weights(from_disk['weights'])

# Lets see the Vector for word "this"
print (new_v("this"))

Output:

tf.Tensor(
[[0.         0.         0.         0.         0.91629076 0.
  0.         0.         0.         0.        ]], shape=(1, 10), dtype=float32)
**********
tf.Tensor(
[[0.         0.         0.         0.         0.91629076 0.
  0.         0.         0.         0.        ]], shape=(1, 10), dtype=float32)
like image 151
mujjiga Avatar answered Feb 25 '26 07:02

mujjiga


One can use a bit of a hack to do this. Construct your TextVectorization object, then put it in a model. Save the model to save the vectorizer. Loading the model will reproduce the vectorizer. See the example below.

import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

data = [
    "The sky is blue.",
    "Grass is green.",
    "Hunter2 is my password.",
]

# Create vectorizer.
text_dataset = tf.data.Dataset.from_tensor_slices(data)
vectorizer = TextVectorization(
    max_tokens=100000, output_mode='tf-idf', ngrams=None,
)
vectorizer.adapt(text_dataset.batch(1024))

# Create model.
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorizer)

# Save.
filepath = "tmp-model"
model.save(filepath, save_format="tf")

# Load.
loaded_model = tf.keras.models.load_model(filepath)
loaded_vectorizer = loaded_model.layers[0]

Here is a test that both vectorizers (original and loaded) produce the same output.

import numpy as np

np.testing.assert_allclose(loaded_vectorizer("blue"), vectorizer("blue"))
like image 33
jakub Avatar answered Feb 25 '26 09:02

jakub



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!