Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pre-process new instances for classification, so that the feature encoding is the same as the model with Scikit-learn?

I am creating models using multi-class classification for data, which has 6 features. I am pre-processing the data with the code below, using LabelEncoder.

#Encodes the data for each column.
def pre_process_data(self):
    self.encode_column('feedback_rating')
    self.encode_column('location')
    self.encode_column('condition_id')
    self.encode_column('auction_length')
    self.encode_column('model')
    self.encode_column('gb') 

#Gets the column using the column name, transforms the column data and resets
#the column
def encode_column(self, name):
    le = preprocessing.LabelEncoder()
    current_column = np.array(self.X_df[name]).tolist()
    self.X_df[name] = le.fit_transform(current_column)

When I want to predict a new instance I need to transform the data of the new instance so that the features match the same encoding as those in the model. Is there a simple way of achieving this?

Also if I want to persist the model and retrieve it, then is there a simple way of saving the encoding format, in order to use it to transform new instances on the retrieved model?

like image 805
Rich Gray Avatar asked Mar 20 '15 16:03

Rich Gray


1 Answers

When I want to predict a new instance I need to transform the data of the new instance so that the features match the same encoding as those in the model. Is there a simple way of achieving this?

If not entirely sure how your classification 'pipeline' operates, but you can just use your fit LabelEncoder method on some new data - le will transform new data, provided the labels are what exist in training set.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

# training data
train_x = [0,1,2,6,'true','false']
le.fit_transform(train_x)
# array([0, 1, 1, 2, 4, 3])

# transform some new data
new_x = [0,0,0,2,2,2,'false']
le.transform(new_x)
# array([0, 0, 0, 1, 1, 1, 3])

# transform data with a new feature
bad_x = [0,2,6,'new_word']
le.transform(bad_x)
# ValueError: y contains new labels: ['0' 'new_word']

Also if I want to persist the model and retrieve it, then is there a simple way of saving the encoding format, in order to use it to transform new instances on the retrieved model?

You can save models/parts of your models like this:

import cPickle as pickle
from sklearn.externals import joblib
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
train_x = [0,1,2,6,'true','false']
le.fit_transform(train_x)

# Save your encoding
joblib.dump(le, '/path/to/save/model')
# OR
pickle.dump(le, open( '/path/to/model', "wb" ) )

# Load those encodings
le = joblib.load('/path/to/save/model') 
# OR
le = pickle.load( open( '/path/to/model', "rb" ) )

# Then use as normal
new_x = [0,0,0,2,2,2,'false']
le.transform(new_x)
# array([0, 0, 0, 1, 1, 1, 3])
like image 146
AGS Avatar answered Oct 19 '22 23:10

AGS