Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn - how to incorporate missing data when one-hot encoding

I'm trying to keep rows in a dataset that contain missing data.

When one-hot encoding a column (or multiple columns) with sklearn. Is it possible to write a rule that if currentItem == null or if currentItem == 0 then set the output array to all 0s?

e.g.

A A B -> [[1, 0], [1, 0], [0,1]]

B B A -> [[0, 1], [0, 1], [1,0]]

null B A -> [[0, 0], [0, 1], [1,0]]


one-hot encoding:

import numpy as np
from sklearn.preprocessing import LabelEncoder


dataset = np.loadtxt("someFile.csv", delimiter=",")
B = dataset[:,1]

encoder = LabelEncoder()
encoder.fit(B)
encoded_B = encoder.transform(B)

Y = to_categorical(encoded_B)

EDIT - Example Dataset: Where A-E are inputs and X & Y and outputs

A     B     C     D     E     X      Y
7     6     3     3     2     11     4
5     6     0     0     7     15     7
3     3     9     null  7     12     7
7     null  7     null  7     12     13
null  7     4     6     12    13     4
null  5     7     6     null  14     7
2     6     0     0     2     13     3
7     null  7     null  2     13     7
like image 659
JoeBoggs Avatar asked Jan 04 '18 07:01

JoeBoggs


People also ask

What is the drawback of using one-hot encoding?

Another disadvantage of one-hot encoding is that it produces multicollinearity among the various variables, lowering the model's accuracy. In addition, you may wish to transform the values back to categorical form so that they may be displayed in your application.

How do you fill a categorical missing value in Python?

For the numerical Columns you can try replacing the missing values by taking Mean / Median of the column values. This method is suitable for Categorical data which i assume is your case. You can try replacing missing vlaues in all three Columns with the most frequently occuring value in the given column.

Which issue may appear when one-hot encoding is used?

Problems Faced with One Hot EncodingThe Dummy Variable Trap, therefore, leads to another problem known as multicollinearity. Multicollinearity occurs only when there is a dependency between the independent features.


Video Answer


1 Answers

If you have pandas, this is pretty simple.

s = pd.Series(['A', 'A', 0, 'B', 0, 'A', np.nan])
s

0      A
1      A
2      0
3      B
4      0
5      A
6    NaN
dtype: object

Use replace to convert 0 to NaN -

s = s.replace({0 : np.nan, '0' : np.nan})
s

0      A
1      A
2    NaN
3      B
4    NaN
5      A
6    NaN
dtype: object

Now, call pd.get_dummies, which ignores NaN values.

pd.get_dummies(s)

   A  B
0  1  0
1  1  0
2  0  0
3  0  1
4  0  0
5  1  0
6  0  0

The solution is the same for a dataframe.

like image 87
cs95 Avatar answered Oct 30 '22 13:10

cs95