Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can we make the ML model (pickle file) more robust, by accepting (or ignoring) new features?

  • I have trained a ML model, and stored it into a Pickle file.
  • In my new script, I am reading new 'real world data', on which I want to do a prediction.

However, I am struggling. I have a column (containing string values), like:

Sex       
Male       
Female
# This is just as example, in real it is having much more unique values

Now comes the issue. I received a new (unique) value, and now I cannot make predictions anymore (e.g. 'Neutral' was added).

Since I am transforming the 'Sex' column into Dummies, I do have the issue that my model is not accepting the input anymore,

Number of features of the model must match the input. Model n_features is 2 and input n_features is 3

Therefore my question: is there a way how I can make my model robust, and just ignore this class? But do a prediction, without the specific info?

What I have tried:

df = pd.read_csv('dataset_that_i_want_to_predict.csv')
model = pickle.load(open("model_trained.sav", 'rb'))

# I have an 'example_df' containing just 1 row of training data (this is exactly what the model needs)
example_df = pd.read_csv('reading_one_row_of_trainings_data.csv')

# Checking for missing columns, and adding that to the new dataset 
missing_cols = set(example_df.columns) - set(df.columns)
for column in missing_cols:
    df[column] = 0 #adding the missing columns, with 0 values (Which is ok. since everything is dummy)

# make sure that we have the same order 
df = df[example_df.columns] 

# The prediction will lead to an error!
results = model.predict(df)

# ValueError: Number of features of the model must match the input. Model n_features is X and n_features is Y

Note, I searched, but could not find any helpfull solution (not here, here or here

UPDATE

Also found this article. But same issue here.. we can make the test set with the same columns as training set... but what about new real world data (e.g. the new value 'Neutral')?

like image 1000
R overflow Avatar asked Nov 19 '20 11:11

R overflow


People also ask

How do you store a ML model in pickles?

To save the model all we need to do is pass the model object into the dump() function of Pickle. This will serialize the object and convert it into a “byte stream” that we can save as a file called model.

What is the use of pickle file in machine learning?

The pickle module keeps track of the objects it has already serialized, so that later references to the same object won't be serialized again, thus allowing for faster execution time. Allows saving model in very little time.

How does machine learning model deploy with pickle?

To use it, we first need to save it and then load in a different process. Pickle is a serialization/deserialization module which is already built-in in Python: using it we can save an arbitrary Python object (with a few exceptions) to a file. Once we have a file, we can load the model from there in a different process.

How do you pickle the model for future use?

In a nutshell Using pickle , simply save your model on disc with dump() function and de-pickle it into your python code with load() function. Use open() function to create and/or read from a . pkl file and make sure you open the file in the binary format by wb for write and rb for read mode.


1 Answers

Yes, you can't include (update the model) a new category or feature into a dataset after the training part is done. OneHotEncoder might handle the problem of having new categories inside some feature in test data. It will take care of keep the columns consistent in your training and test data with respect to categorical variables.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
from sklearn import set_config
set_config(print_changed_only=True)
df = pd.DataFrame({'feature_1': np.random.rand(20),
                   'feature_2': np.random.choice(['male', 'female'], (20,))})
target = pd.Series(np.random.choice(['yes', 'no'], (20,)))

model = Pipeline([('preprocess',
                   ColumnTransformer([('ohe',
                                       OneHotEncoder(handle_unknown='ignore'), [1])],
                                       remainder='passthrough')),
                  ('lr', LogisticRegression())])

model.fit(df, target)

# let us introduce new categories in feature_2 in test data
test_df = pd.DataFrame({'feature_1': np.random.rand(20),
                        'feature_2': np.random.choice(['male', 'female', 'neutral', 'unknown'], (20,))})
model.predict(test_df)
# array(['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes'], dtype=object)
like image 120
Venkatachalam Avatar answered Oct 15 '22 11:10

Venkatachalam