How to encode multiple features at once with SciKit Learn transform

Question

I am trying to encode some categorical features to be able to use them as features in a machine learning model, at the moment I have the following code:

data_path = '/Users/novikov/Assignment2/epl-training.csv'
data = pd.read_csv(data_path)
data['Date'] = pd.to_datetime(data['Date'])

le = preprocessing.LabelEncoder()


data['HomeTeam'] = le.fit_transform(data.HomeTeam.values)
data['AwayTeam'] = le.fit_transform(data.AwayTeam.values)
data['FTR'] = le.fit_transform(data.FTR.values)
data['HTR'] = le.fit_transform(data.HTR.values)
data['Referee'] = le.fit_transform(data.Referee.values)

This works fine, however this is not ideal because if there were 100 features to encode, it would take way too long to do it by hand. How do I automate the process? I have tried implementing a loop:

label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']

for feature in label_encode:
    method = 'data.' + feature + '.values'
    data[feature] = le.fit_transform(method)

But I get ValueError: bad input shape ():

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-1b8fb6164d2d> in <module>()
     11     method = 'data.' + feature + '.values'
     12     print(method)
---> 13     data[feature] = le.fit_transform(method)

/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
    109         y : array-like of shape [n_samples]
    110         """
--> 111         y = column_or_1d(y, warn=True)
    112         self.classes_, y = np.unique(y, return_inverse=True)
    113         return y

/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
    612         return np.ravel(y)
    613 
--> 614     raise ValueError("bad input shape {0}".format(shape))
    615 
    616 

ValueError: bad input shape ()

None of the variations of this code (like just putting data.feature.values) seem to work. There must be a way of doing it other than writing it by hand.

desertnaut · Accepted Answer

Of course, method = 'data.' + feature + '.values' will not work - it is a string itself! Try instead

method = data[feature].values

or

for feature in label_encode:
    data[feature] = le.fit_transform(data[feature].values)

piRSquared · Answer

The way the encoder object works is that when you fit it stores some meta data in the object's attributes. These attributes get used when you want to transform the data. fit_transform is a convenience method to fit and transform in one step.

When you decide to use the same object to do another fit_transform, you are overwriting the the stored meta data. That is fine if you don't want to use the objects inverse_transform.

Setup

df = pd.DataFrame({
    'HomeTeam':[1, 3, 27],
    'AwayTeam':[9, 8, 100],
    'FTR':['dog', 'cat', 'dog'],
    'HTR': [*'XYY'],
    'Referee': [*'JJB']
})

Answer to your question

update and apply

le = preprocessing.LabelEncoder()
label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']

df.update(df[label_encode].apply(le.fit_transform))
df

   AwayTeam FTR HTR  HomeTeam Referee
0         1   1   0         0       1
1         0   0   1         1       1
2         2   1   1         2       0

How I'd Do It

Each separate encoder is captured in the le dictionary for potential later use

from collections import defaultdict
le = defaultdict(preprocessing.LabelEncoder)
label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']

df = df.assign(**{k: le[k].fit_transform(df[k]) for k in label_encode})
df

   AwayTeam FTR HTR  HomeTeam Referee
0         1   1   0         0       1
1         0   0   1         1       1
2         2   1   1         2       0

`pandas.factorize`

If you just want codes, you can use Pandas' factorize. Note that this will not sort the final values and labels them in the order they first appear.

df.update(df[label_encode].apply(lambda x: x.factorize()[0]))
df

   AwayTeam FTR HTR  HomeTeam Referee
0         0   0   0         0       0
1         1   1   1         1       0
2         2   0   1         2       1

Numpy's `unique`

This does sort the final values and will look like LabelEncoder

df.update(df[label_encode].apply(lambda x: np.unique(x, return_inverse=True)[1]))

   AwayTeam FTR HTR  HomeTeam Referee
0         1   1   0         0       1
1         0   0   1         1       1
2         2   1   1         2       0

How to encode multiple features at once with SciKit Learn transform

Tags:

python

pandas

encoding

machine-learning

scikit-learn

Ivan Novikov

2 Answers

desertnaut

Setup

Answer to your question

How I'd Do It

`pandas.factorize`

Numpy's `unique`

piRSquared

Recent Activity

Donate For Us

How to encode multiple features at once with SciKit Learn transform

Tags:

python

pandas

encoding

machine-learning

scikit-learn

Ivan Novikov

2 Answers

desertnaut

Setup

Answer to your question

How I'd Do It

pandas.factorize

Numpy's unique

piRSquared

Related questions

Recent Activity

Donate For Us

`pandas.factorize`

Numpy's `unique`