I am trying to encode some categorical features to be able to use them as features in a machine learning model, at the moment I have the following code:
data_path = '/Users/novikov/Assignment2/epl-training.csv'
data = pd.read_csv(data_path)
data['Date'] = pd.to_datetime(data['Date'])
le = preprocessing.LabelEncoder()
data['HomeTeam'] = le.fit_transform(data.HomeTeam.values)
data['AwayTeam'] = le.fit_transform(data.AwayTeam.values)
data['FTR'] = le.fit_transform(data.FTR.values)
data['HTR'] = le.fit_transform(data.HTR.values)
data['Referee'] = le.fit_transform(data.Referee.values)
This works fine, however this is not ideal because if there were 100 features to encode, it would take way too long to do it by hand. How do I automate the process? I have tried implementing a loop:
label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']
for feature in label_encode:
method = 'data.' + feature + '.values'
data[feature] = le.fit_transform(method)
But I get ValueError: bad input shape ():
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-1b8fb6164d2d> in <module>()
11 method = 'data.' + feature + '.values'
12 print(method)
---> 13 data[feature] = le.fit_transform(method)
/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
109 y : array-like of shape [n_samples]
110 """
--> 111 y = column_or_1d(y, warn=True)
112 self.classes_, y = np.unique(y, return_inverse=True)
113 return y
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
612 return np.ravel(y)
613
--> 614 raise ValueError("bad input shape {0}".format(shape))
615
616
ValueError: bad input shape ()
None of the variations of this code (like just putting data.feature.values) seem to work. There must be a way of doing it other than writing it by hand.
Of course, method = 'data.' + feature + '.values' will not work - it is a string itself! Try instead
method = data[feature].values
or
for feature in label_encode:
data[feature] = le.fit_transform(data[feature].values)
The way the encoder object works is that when you fit it stores some meta data in the object's attributes. These attributes get used when you want to transform the data. fit_transform is a convenience method to fit and transform in one step.
When you decide to use the same object to do another fit_transform, you are overwriting the the stored meta data. That is fine if you don't want to use the objects inverse_transform.
df = pd.DataFrame({
'HomeTeam':[1, 3, 27],
'AwayTeam':[9, 8, 100],
'FTR':['dog', 'cat', 'dog'],
'HTR': [*'XYY'],
'Referee': [*'JJB']
})
update and apply
le = preprocessing.LabelEncoder()
label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']
df.update(df[label_encode].apply(le.fit_transform))
df
AwayTeam FTR HTR HomeTeam Referee
0 1 1 0 0 1
1 0 0 1 1 1
2 2 1 1 2 0
Each separate encoder is captured in the le dictionary for potential later use
from collections import defaultdict
le = defaultdict(preprocessing.LabelEncoder)
label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']
df = df.assign(**{k: le[k].fit_transform(df[k]) for k in label_encode})
df
AwayTeam FTR HTR HomeTeam Referee
0 1 1 0 0 1
1 0 0 1 1 1
2 2 1 1 2 0
pandas.factorizeIf you just want codes, you can use Pandas' factorize. Note that this will not sort the final values and labels them in the order they first appear.
df.update(df[label_encode].apply(lambda x: x.factorize()[0]))
df
AwayTeam FTR HTR HomeTeam Referee
0 0 0 0 0 0
1 1 1 1 1 0
2 2 0 1 2 1
uniqueThis does sort the final values and will look like LabelEncoder
df.update(df[label_encode].apply(lambda x: np.unique(x, return_inverse=True)[1]))
AwayTeam FTR HTR HomeTeam Referee
0 1 1 0 0 1
1 0 0 1 1 1
2 2 1 1 2 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With