I am working on a prediction project (for fun) and basically I pulled male and female names from nltk, label names as 'male' or 'female', then get the last letter of each name, and in the end use different machine learning algorithms to train and predict gender based on last letter.
So we know that Python's sklearn does NOT handle categorical data, so I used LabelEncoder to transform last letter to numeric values:
Before transform:
name last_letter gender
0 Aamir r male
1 Aaron n male
2 Abbey y male
3 Abbie e male
4 Abbot t male
name last_letter gender
0 Abagael l female
1 Abagail l female
2 Abbe e female
3 Abbey y female
4 Abbi i female
And if we concatenate two dataframes, drop the name column and shuffle:
last_letter gender
0 a male
1 e female
2 g male
3 h male
4 e male
Then I used LabelEncoder
:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in df.columns:
df[col]= label_encoder.fit_transform(df[col])
df.head()
The dataframe becomes:
last_letter gender
0 1 male
1 5 female
2 7 male
3 8 male
4 5 male
As you can see, after training the model (let's say Random Forest here). If I want to use the model to predict a random letter
e.g. rf_model.predict('a')
It's not gonna work since the model only takes numeric values. In this case if I do:
rf_model.predict(1) (assume letter 'a' is encoded as number 1)
The model prediction result returns
array([1])
So how do I do something like:
rf_model.predict('a')
and get a result like 'female' or 'male' instead of having to enter a numeric value and get a result as numeric value?
Just use the same LabelEncoder
you created! Since you already fit
it with the training data, you can directly apply new data with transform
function.
In [2]: from sklearn.preprocessing import LabelEncoder
In [3]: label_encoder = LabelEncoder()
In [4]: label_encoder.fit_transform(['a', 'b', 'c'])
Out[4]: array([0, 1, 2])
In [5]: label_encoder.transform(['a'])
Out[5]: array([0])
To use it with RandomForestClassifier
,
In [59]: from sklearn.ensemble import RandomForestClassifier
In [60]: X = ['a', 'b', 'c']
In [61]: y = ['male', 'female', 'female']
In [62]: X_encoded = label_encoder.fit_transform(X)
In [63]: rf_model = RandomForestClassifier()
In [64]: rf_model.fit(X_encoded[:, None], y)
Out[64]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)
In [65]: x = ['a']
In [66]: x_encoded = label_encoder.transform(x)
In [67]: rf_model.predict(x_encoded[:, None])
Out[67]:
array(['male'],
dtype='<U6')
As you can see, you can get string output 'male', 'female'
directly from classifier if you used them to fit the classifier.
Refer to LabelEncoder.transform
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With