I am developing a spam classifier using scikit learn.
Here is my vectorizing code
vectorizer = TfidfVectorizer(
analyzer='word',
sublinear_tf=True,
strip_accents='unicode',
token_pattern=r'\w{1,}',
ngram_range=(1, 1),
max_features=10000)
tfidf = vectorizer.fit(data['text'])
features = vectorizer.transform(data['text'])
import pickle
pickle.dump(tfidf, open('tfidf.pickle', 'wb'))
Here is what I am doing to predict new input
import joblib
model = joblib.load('model')
vect = pickle.load(open('tfidf.pickle', 'rb'))
new = vect.transform(['some new text...'])
mod.predict(new)
When I open vectorizer file (tfidf.pickle) and try to predict a new message it shows me the error as
ValueError: X.shape[1] = 7148 should be equal to 38011, the number of features at training time
The error message says that your model expects an input with size 38011, while your TF-IDF vectorizer outputs vectors of dimension 7148. You have a model/preprocessor mismatch here, i.e. your model was trained on vectors that are 38011-dimensional while your TF-IDF outputs vectors that are 7148-dimensional.
A good way to avoid this preprocessing/model mismatch is to use scikit-learn pipelines. For instance here you could train your model and your TF-IDF vectorizer with the following piece of code (example with a logistic regression here):
from sklearn.preprocessing import make_pipeline
vectorizer = TfidfVectorizer(...your TF-IDF arguments...)
model = LogisticRegression(...your model arguments...)
pipeline = make_pipeline(vectorizer, model)
pipeline.fit(X, y)
And then you can serialize and load your pipeline with pickle or joblib (e.g. pickle.dump(pipeline, open('spam_pipeline.pickle', 'wb')), then pipeline = pickle.load(open('spam_pipeline.pickle', 'rb')) , similarly to what you were already doing.
You can directly use the predict method of the pipeline to get a prediction.
Let me know if you need more details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With