I am rying to use the label encoder in orrder to convert categorical data into numeric values.
I needed a LabelEncoder that keeps my missing values as 'NaN' to use an Imputer afterwards. So I would like to use a mask to replace form the original data frame after labelling like this
df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})
A B C
0 x 1 2.0
1 NaN 6 1.0
2 z 9 NaN
dfTmp = df
mask = dfTmp.isnull()
A B C
0 False False False
1 True False False
2 False False True
So I get a dataframe with True/false value
Then , in create the encoder :
df = df.astype(str).apply(LabelEncoder().fit_transform)
How can I proceed then, in orfer to encoder these values?
thanks
The first question is: do you wish to encode each column separately or encode them all with one encoding?
The expression df = df.astype(str).apply(LabelEncoder().fit_transform)
implies that you encode all the columns separately.
That case you can do the following:
df = df.apply(lambda series: pd.Series(
LabelEncoder().fit_transform(series[series.notnull()]),
index=series[series.notnull()].index
))
print(df)
Out:
A B C
0 0.0 0 1.0
1 NaN 1 0.0
2 1.0 2 NaN
the explenation how it works below. But, for starters, I'll tell about a couple of drawbacks of this solution.
Drawbacks
First, there are a mixed types of columns: if a column contains a NaN
value, then column has a type float
, because nan's are floats in python.
df.dtypes
A float64
B int64
C float64
dtype: object
It seems to be meaningless for labels. Okay, later you can ignore all the nan's and covert the rest to integer.
The second point is: probably you need to memorize a LabelEncoder
- because often it's required to do, for instance, inverse transform. But this solution doesn't memorize encoders, you have no such varaible.
A simple, explicit solution is:
encoders = dict()
for col_name in df.columns:
series = df[col_name]
label_encoder = LabelEncoder()
df[col_name] = pd.Series(
label_encoder.fit_transform(series[series.notnull()]),
index=series[series.notnull()].index
)
encoders[col_name] = label_encoder
print(df)
Out:
A B C
0 0.0 0 1.0
1 NaN 1 0.0
2 1.0 2 NaN
- more code, but result is the same
print(encoders)
Out
{'A': LabelEncoder(), 'B': LabelEncoder(), 'C': LabelEncoder()}
- also, encoders are available. Inverse transform (should drop nan's before!) too:
encoders['B'].inverse_transform(df['B'])
Out:
array([1, 6, 9])
Also, some options like some registry superclass for encoders also available and they are compatible with the first solution, but easier to iterate through a columns.
How it works
The df.apply(lambda series: ...)
applies a function which returns pd.Series
to each column; so, it returns a dataframe with a new values.
Expression step by step:
pd.Series(
LabelEncoder().fit_transform(series[series.notnull()]),
index=series[series.notnull()].index
)
- series[series.notnull()]
drop NaN
values, then feeds the rest to the fit_transform
.
- as the label encoder returns a numpy.array
and throws out an index, index=series[series.notnull()].index
restores it to concatenate it correctly. If don't do indexing:
print(df)
Out:
A B C
0 x 1 2.0
1 NaN 6 1.0
2 z 9 NaN
df = df.apply(lambda series: pd.Series(
LabelEncoder().fit_transform(series[series.notnull()]),
))
print(df)
Out:
A B C
0 0.0 0 1.0
1 1.0 1 0.0
2 NaN 2 NaN
- values shift from correct positions - and even an IndexError
may occur.
Single encoder for all columns
That case, stack dataframe, fit encodet, then unstack it
series_stack = df.stack().astype(str)
label_encoder = LabelEncoder()
df = pd.Series(
label_encoder.fit_transform(series_stack),
index=series_stack.index
).unstack()
print(df)
Out:
A B C
0 5.0 0.0 2.0
1 NaN 3.0 1.0
2 6.0 4.0 NaN
- as the series_stack
is pd.Series
containing NaN
's, all values from the DataFrame is floats, so you may prefer to convert it.
Hope it helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With