Predict unseen data by previously trained model

Question

I am performing supervised machine learning using Scikit-learn. I have two datasets. First dataset contains data that has X features and Y labels. Second dataset contains only X features but NO Y labels. I can successfully perform the LinearSVC for training/testing data and get the Y labels for the test dataset.

Now, I want to use the model that I have trained for the first dataset to predict the second dataset labels. How do I use the pre-trained model from first dataset to second dataset (unseen labels) in Scikit-learn?

Code snippet from my attempts: UPDATED code from comments below:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import pickle


# ----------- Dataset 1: for training ----------- #
# Sample data ONLY
some_text = ['Books are amazing',
             'Harry potter book is awesome. It rocks',
             'Nutrition is very important',
             'Welcome to library, you can find as many book as you like',
             'Food like brocolli has many advantages']
y_variable = [1,1,0,1,0]

# books = 1 : y label
# food = 0 : y label

df = pd.DataFrame({'text':some_text,
                   'y_variable': y_variable
                          })

# ------------- TFIDF process -------------#
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(df['text']).toarray()
labels = df.y_variable
features.shape


# ------------- Build Model -------------#
model = LinearSVC()
X_train, X_test, y_train, y_test= train_test_split(features,
                                                 labels,
                                                 train_size=0.5,
                                                 random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)


# Export model
pickle.dump(model, open('model.pkl', 'wb'))
# Read the Model
model_pre_trained = pickle.load(open('model.pkl','rb'))


# ----------- Dataset 2: UNSEEN DATASET ----------- #

some_text2 = ['Harry potter books are amazing',
             'Gluten free diet is getting popular']

unseen_df = pd.DataFrame({'text':some_text2}) # Notice this doesn't have y_variable. This the is the data set I am trying to predict y_variable labels 1 or 0.


# This is where the ERROR occurs
X_unseen = tfidf.fit_transform(unseen_df['text']).toarray()
y_pred_unseen = model_pre_trained.predict(X_unseen) # error here: 
# ValueError: X has 11 features per sample; expecting 26


print(X_unseen.shape) # prints (2, 11)
print(X_train.shape) # prints (2, 26)


# Looking for an output like this for UNSEEN data
# Looking for results after predicting unseen and no label data. 
text                                   y_variable
Harry potter books are amazing         1
Gluten free diet is getting popular    0

It doesn't have to be pickle code as I tried above. I am looking if someone has suggestion or if there is any pre-build function that does the prediction from scikit?

Arturo Sbr · Accepted Answer

As you can see, your first tfidf is turning your input into 26 features, while your second tfidf is turning them into 11 features. The error is therefore happening because X_train is different in shape to X_unseen. The prompt tells you that each observation in X_unseen has less features than the number of features model was trained to receive.

Once you load model in the second script, you are fitting another vectorizer to the text. That is, tfidf from the first script and tfidf from the second one are different objects. In order to make predictions with model, you need to transform X_unseen using the original tfidf. In order to do this, you must export the original vectorizer, load it in the new script and transform the new data with it before passing it to model.

### Do this in the first program
# Dump model and tfidf
pickle.dump(model, open('model.pkl', 'wb'))
pickle.dump(tfidf, open('tfidf.pkl', 'wb'))

### Do this in the second program
model = pickle.load(open('model.pkl', 'rb'))
tfidf = pickle.load(open('tfidf.pkl', 'rb'))

# Use `transform` instead of `fit_transform`
X_unseen = tfidf.transform(unseen_df['text']).toarray()

# Predict on `X_unseen`
y_pred_unseen = model_pre_trained.predict(X_unseen)

Predict unseen data by previously trained model

Tags:

python

python-3.x

machine-learning

scikit-learn

sharp

1 Answers

Arturo Sbr

Recent Activity

Donate For Us

Predict unseen data by previously trained model

Tags:

python

python-3.x

machine-learning

scikit-learn

sharp

1 Answers

Arturo Sbr

Related questions

Recent Activity

Donate For Us