Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LabelEncoder that keeps missing values as 'NaN'

I am rying to use the label encoder in orrder to convert categorical data into numeric values.

I needed a LabelEncoder that keeps my missing values as 'NaN' to use an Imputer afterwards. So I would like to use a mask to replace form the original data frame after labelling like this

df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})


    A   B   C
0   x   1   2.0
1   NaN 6   1.0
2   z   9   NaN


dfTmp = df
mask = dfTmp.isnull()

       A    B   C
0   False   False   False
1   True    False   False
2   False   False   True

So I get a dataframe with True/false value

Then , in create the encoder :

df = df.astype(str).apply(LabelEncoder().fit_transform)

How can I proceed then, in orfer to encoder these values?

thanks

like image 837
Nasri Avatar asked Jan 30 '19 15:01

Nasri


1 Answers

The first question is: do you wish to encode each column separately or encode them all with one encoding?

The expression df = df.astype(str).apply(LabelEncoder().fit_transform) implies that you encode all the columns separately.

That case you can do the following:
df = df.apply(lambda series: pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
))
print(df)
Out:
     A  B    C
0  0.0  0  1.0
1  NaN  1  0.0
2  1.0  2  NaN

the explenation how it works below. But, for starters, I'll tell about a couple of drawbacks of this solution.

Drawbacks
First, there are a mixed types of columns: if a column contains a NaN value, then column has a type float, because nan's are floats in python.

df.dtypes
A    float64
B      int64
C    float64
dtype: object

It seems to be meaningless for labels. Okay, later you can ignore all the nan's and covert the rest to integer.

The second point is: probably you need to memorize a LabelEncoder - because often it's required to do, for instance, inverse transform. But this solution doesn't memorize encoders, you have no such varaible.

A simple, explicit solution is:

encoders = dict()

for col_name in df.columns:
    series = df[col_name]
    label_encoder = LabelEncoder()
    df[col_name] = pd.Series(
        label_encoder.fit_transform(series[series.notnull()]),
        index=series[series.notnull()].index
    )
    encoders[col_name] = label_encoder

print(df)
Out:
     A  B    C
0  0.0  0  1.0
1  NaN  1  0.0
2  1.0  2  NaN

- more code, but result is the same

print(encoders)
Out
{'A': LabelEncoder(), 'B': LabelEncoder(), 'C': LabelEncoder()}

- also, encoders are available. Inverse transform (should drop nan's before!) too:

encoders['B'].inverse_transform(df['B'])
Out:
array([1, 6, 9])

Also, some options like some registry superclass for encoders also available and they are compatible with the first solution, but easier to iterate through a columns.

How it works

The df.apply(lambda series: ...) applies a function which returns pd.Series to each column; so, it returns a dataframe with a new values.

Expression step by step:

pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
)

- series[series.notnull()] drop NaN values, then feeds the rest to the fit_transform.

- as the label encoder returns a numpy.array and throws out an index, index=series[series.notnull()].index restores it to concatenate it correctly. If don't do indexing:

print(df)
Out:
     A  B    C
0    x  1  2.0
1  NaN  6  1.0
2    z  9  NaN
df = df.apply(lambda series: pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
))
print(df)
Out:
     A  B    C
0  0.0  0  1.0
1  1.0  1  0.0
2  NaN  2  NaN

- values shift from correct positions - and even an IndexError may occur.

Single encoder for all columns

That case, stack dataframe, fit encodet, then unstack it

series_stack = df.stack().astype(str)
label_encoder = LabelEncoder()
df = pd.Series(
    label_encoder.fit_transform(series_stack),
    index=series_stack.index
).unstack()
print(df)
Out:
     A    B    C
0  5.0  0.0  2.0
1  NaN  3.0  1.0
2  6.0  4.0  NaN

- as the series_stack is pd.Series containing NaN's, all values from the DataFrame is floats, so you may prefer to convert it.

Hope it helps.

like image 182
Mikhail Stepanov Avatar answered Nov 12 '22 10:11

Mikhail Stepanov