I am performing supervised machine learning using Scikit-learn. I have two datasets. First dataset contains data that has X features and Y labels. Second dataset contains only X features but NO Y labels. I can successfully perform the LinearSVC for training/testing data and get the Y labels for the test dataset.
Now, I want to use the model that I have trained for the first dataset to predict the second dataset labels. How do I use the pre-trained model from first dataset to second dataset (unseen labels) in Scikit-learn?
Code snippet from my attempts: UPDATED code from comments below:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import pickle
# ----------- Dataset 1: for training ----------- #
# Sample data ONLY
some_text = ['Books are amazing',
'Harry potter book is awesome. It rocks',
'Nutrition is very important',
'Welcome to library, you can find as many book as you like',
'Food like brocolli has many advantages']
y_variable = [1,1,0,1,0]
# books = 1 : y label
# food = 0 : y label
df = pd.DataFrame({'text':some_text,
'y_variable': y_variable
})
# ------------- TFIDF process -------------#
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(df['text']).toarray()
labels = df.y_variable
features.shape
# ------------- Build Model -------------#
model = LinearSVC()
X_train, X_test, y_train, y_test= train_test_split(features,
labels,
train_size=0.5,
random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Export model
pickle.dump(model, open('model.pkl', 'wb'))
# Read the Model
model_pre_trained = pickle.load(open('model.pkl','rb'))
# ----------- Dataset 2: UNSEEN DATASET ----------- #
some_text2 = ['Harry potter books are amazing',
'Gluten free diet is getting popular']
unseen_df = pd.DataFrame({'text':some_text2}) # Notice this doesn't have y_variable. This the is the data set I am trying to predict y_variable labels 1 or 0.
# This is where the ERROR occurs
X_unseen = tfidf.fit_transform(unseen_df['text']).toarray()
y_pred_unseen = model_pre_trained.predict(X_unseen) # error here:
# ValueError: X has 11 features per sample; expecting 26
print(X_unseen.shape) # prints (2, 11)
print(X_train.shape) # prints (2, 26)
# Looking for an output like this for UNSEEN data
# Looking for results after predicting unseen and no label data.
text y_variable
Harry potter books are amazing 1
Gluten free diet is getting popular 0
It doesn't have to be pickle code as I tried above. I am looking if someone has suggestion or if there is any pre-build function that does the prediction from scikit?
As you can see, your first tfidf
is turning your input into 26 features, while your second tfidf
is turning them into 11 features. The error is therefore happening because X_train
is different in shape to X_unseen
. The prompt tells you that each observation in X_unseen
has less features than the number of features model
was trained to receive.
Once you load model
in the second script, you are fitting another vectorizer to the text. That is, tfidf
from the first script and tfidf
from the second one are different objects. In order to make predictions with model
, you need to transform X_unseen
using the original tfidf
. In order to do this, you must export the original vectorizer, load it in the new script and transform the new data with it before passing it to model
.
### Do this in the first program
# Dump model and tfidf
pickle.dump(model, open('model.pkl', 'wb'))
pickle.dump(tfidf, open('tfidf.pkl', 'wb'))
### Do this in the second program
model = pickle.load(open('model.pkl', 'rb'))
tfidf = pickle.load(open('tfidf.pkl', 'rb'))
# Use `transform` instead of `fit_transform`
X_unseen = tfidf.transform(unseen_df['text']).toarray()
# Predict on `X_unseen`
y_pred_unseen = model_pre_trained.predict(X_unseen)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With