However, I am struggling. I have a column (containing string values), like:
Sex
Male
Female
# This is just as example, in real it is having much more unique values
Now comes the issue. I received a new (unique) value, and now I cannot make predictions anymore (e.g. 'Neutral'
was added).
Since I am transforming the 'Sex'
column into Dummies, I do have the issue that my model is not accepting the input anymore,
Number of features of the model must match the input. Model n_features is 2 and input n_features is 3
Therefore my question: is there a way how I can make my model robust, and just ignore this class? But do a prediction, without the specific info?
What I have tried:
df = pd.read_csv('dataset_that_i_want_to_predict.csv')
model = pickle.load(open("model_trained.sav", 'rb'))
# I have an 'example_df' containing just 1 row of training data (this is exactly what the model needs)
example_df = pd.read_csv('reading_one_row_of_trainings_data.csv')
# Checking for missing columns, and adding that to the new dataset
missing_cols = set(example_df.columns) - set(df.columns)
for column in missing_cols:
df[column] = 0 #adding the missing columns, with 0 values (Which is ok. since everything is dummy)
# make sure that we have the same order
df = df[example_df.columns]
# The prediction will lead to an error!
results = model.predict(df)
# ValueError: Number of features of the model must match the input. Model n_features is X and n_features is Y
Note, I searched, but could not find any helpfull solution (not here, here or here
UPDATE
Also found this article. But same issue here.. we can make the test set with the same columns as training set... but what about new real world data (e.g. the new value 'Neutral')?
To save the model all we need to do is pass the model object into the dump() function of Pickle. This will serialize the object and convert it into a “byte stream” that we can save as a file called model.
The pickle module keeps track of the objects it has already serialized, so that later references to the same object won't be serialized again, thus allowing for faster execution time. Allows saving model in very little time.
To use it, we first need to save it and then load in a different process. Pickle is a serialization/deserialization module which is already built-in in Python: using it we can save an arbitrary Python object (with a few exceptions) to a file. Once we have a file, we can load the model from there in a different process.
In a nutshell Using pickle , simply save your model on disc with dump() function and de-pickle it into your python code with load() function. Use open() function to create and/or read from a . pkl file and make sure you open the file in the binary format by wb for write and rb for read mode.
Yes, you can't include (update the model) a new category or feature into a dataset after the training part is done.
OneHotEncoder
might handle the problem of having new categories inside some feature in test data.
It will take care of keep the columns consistent in your training and test data with respect to categorical variables.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
from sklearn import set_config
set_config(print_changed_only=True)
df = pd.DataFrame({'feature_1': np.random.rand(20),
'feature_2': np.random.choice(['male', 'female'], (20,))})
target = pd.Series(np.random.choice(['yes', 'no'], (20,)))
model = Pipeline([('preprocess',
ColumnTransformer([('ohe',
OneHotEncoder(handle_unknown='ignore'), [1])],
remainder='passthrough')),
('lr', LogisticRegression())])
model.fit(df, target)
# let us introduce new categories in feature_2 in test data
test_df = pd.DataFrame({'feature_1': np.random.rand(20),
'feature_2': np.random.choice(['male', 'female', 'neutral', 'unknown'], (20,))})
model.predict(test_df)
# array(['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
# 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
# 'yes', 'yes'], dtype=object)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With