Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LabelEncoder - reverse and use categorical data on model

I am working on a prediction project (for fun) and basically I pulled male and female names from nltk, label names as 'male' or 'female', then get the last letter of each name, and in the end use different machine learning algorithms to train and predict gender based on last letter.

So we know that Python's sklearn does NOT handle categorical data, so I used LabelEncoder to transform last letter to numeric values:

Before transform:

     name     last_letter    gender
0    Aamir    r              male
1    Aaron    n              male
2    Abbey    y              male
3    Abbie    e              male
4    Abbot    t              male

     name       last_letter    gender
0    Abagael    l              female
1    Abagail    l              female
2    Abbe       e              female
3    Abbey      y              female
4    Abbi       i              female

And if we concatenate two dataframes, drop the name column and shuffle:

     last_letter    gender
0    a              male
1    e              female
2    g              male
3    h              male
4    e              male

Then I used LabelEncoder:

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

for col in df.columns:
    df[col]= label_encoder.fit_transform(df[col])
df.head()

The dataframe becomes:

     last_letter    gender
0    1              male
1    5              female
2    7              male
3    8              male
4    5              male

As you can see, after training the model (let's say Random Forest here). If I want to use the model to predict a random letter

e.g. rf_model.predict('a')

It's not gonna work since the model only takes numeric values. In this case if I do:

rf_model.predict(1) (assume letter 'a' is encoded as number 1)

The model prediction result returns

array([1])

So how do I do something like:

rf_model.predict('a') 

and get a result like 'female' or 'male' instead of having to enter a numeric value and get a result as numeric value?

like image 652
thatMeow Avatar asked Oct 29 '22 06:10

thatMeow


1 Answers

Just use the same LabelEncoder you created! Since you already fit it with the training data, you can directly apply new data with transform function.

In [2]: from sklearn.preprocessing import LabelEncoder

In [3]: label_encoder = LabelEncoder()

In [4]: label_encoder.fit_transform(['a', 'b', 'c'])
Out[4]: array([0, 1, 2])

In [5]: label_encoder.transform(['a'])
Out[5]: array([0])

To use it with RandomForestClassifier,

In [59]: from sklearn.ensemble import RandomForestClassifier

In [60]: X = ['a', 'b', 'c']

In [61]: y = ['male', 'female', 'female']

In [62]: X_encoded = label_encoder.fit_transform(X)

In [63]: rf_model = RandomForestClassifier()

In [64]: rf_model.fit(X_encoded[:, None], y)
Out[64]: 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [65]: x = ['a']

In [66]: x_encoded = label_encoder.transform(x)

In [67]: rf_model.predict(x_encoded[:, None])
Out[67]: 
array(['male'], 
      dtype='<U6')

As you can see, you can get string output 'male', 'female' directly from classifier if you used them to fit the classifier.

Refer to LabelEncoder.transform

like image 147
YLJ Avatar answered Nov 15 '22 07:11

YLJ