Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to serialize a CountVectorizer with a custom tokenize function with joblib

I use a CountVectorizer with a custom tokenize method. When I serialize it, then unserialize it, I get the following error message :

AttributeError: module '__main__' has no attribute 'tokenize'

How can I "serialize" the tokenize method ?

Here is a small example :

import nltk
from nltk.stem.snowball import FrenchStemmer
stemmer = FrenchStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

tfidf_vec = TfidfVectorizer(tokenizer=tokenize, stop_words=stopwords.words('french'), ngram_range=(1,1))

clf = MLPClassifier(solver='lbfgs', alpha=0.02, hidden_layer_sizes=(400, 50))

pipeline = Pipeline([("tfidf", tfidf_vec),
                ("MLP", clf)])

joblib.dump(pipeline,"../models/classifier.pkl")
like image 444
Maxime Maillot Avatar asked Mar 28 '26 02:03

Maxime Maillot


1 Answers

joblib (and pickle which it uses under the hood) serializes functions this way: it just remembers a path to import a function from - module and function name. So if you define a function in an interactive session, there is no place to import this function from; it is destroyed as soon as process exits.

To make serialization work put this code to a Python module (save it to a .py file), and make sure this module is available (importable) when you're calling joblib.load.

like image 112
Mikhail Korobov Avatar answered Mar 29 '26 15:03

Mikhail Korobov



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!