I'm trying to keep rows in a dataset that contain missing data. When one-hot encoding a column (or multiple columns) with sklearn. Is it possible to write a rule that <code>if currentItem == null</code> or <code>if currentItem == 0</code> then set the output array to all 0s? e.g. <code>A A B</code> -> <code>[[1, 0], [1, 0], [0,1]]</code> <code>B B A</code> -> <code>[[0, 1], [0, 1], [1,0]]</code> <code>null B A</code> -> <code>[[0, 0], [0, 1], [1,0]]</code> <hr> one-hot encoding: <pre class="prettyprint"><code>import numpy as np from sklearn.preprocessing import LabelEncoder dataset = np.loadtxt("someFile.csv", delimiter=",") B = dataset[:,1] encoder = LabelEncoder() encoder.fit(B) encoded_B = encoder.transform(B) Y = to_categorical(encoded_B) </code></pre> EDIT - Example Dataset: Where A-E are inputs and X & Y and outputs <pre class="prettyprint"><code>A B C D E X Y 7 6 3 3 2 11 4 5 6 0 0 7 15 7 3 3 9 null 7 12 7 7 null 7 null 7 12 13 null 7 4 6 12 13 4 null 5 7 6 null 14 7 2 6 0 0 2 13 3 7 null 7 null 2 13 7 </code></pre>

If you have pandas, this is pretty simple. <pre class="prettyprint"><code>s = pd.Series(['A', 'A', 0, 'B', 0, 'A', np.nan]) s 0 A 1 A 2 0 3 B 4 0 5 A 6 NaN dtype: object </code></pre> Use <code>replace</code> to convert <code>0</code> to NaN - <pre class="prettyprint"><code>s = s.replace({0 : np.nan, '0' : np.nan}) s 0 A 1 A 2 NaN 3 B 4 NaN 5 A 6 NaN dtype: object </code></pre> Now, call <code>pd.get_dummies</code>, which ignores NaN values. <pre class="prettyprint"><code>pd.get_dummies(s) A B 0 1 0 1 1 0 2 0 0 3 0 1 4 0 0 5 1 0 6 0 0 </code></pre> The solution is the same for a dataframe.

sklearn - how to incorporate missing data when one-hot encoding

e.g.

A A B -> [[1, 0], [1, 0], [0,1]]

B B A -> [[0, 1], [0, 1], [1,0]]

null B A -> [[0, 0], [0, 1], [1,0]]

one-hot encoding:

import numpy as np
from sklearn.preprocessing import LabelEncoder


dataset = np.loadtxt("someFile.csv", delimiter=",")
B = dataset[:,1]

encoder = LabelEncoder()
encoder.fit(B)
encoded_B = encoder.transform(B)

Y = to_categorical(encoded_B)

EDIT - Example Dataset: Where A-E are inputs and X & Y and outputs

A     B     C     D     E     X      Y
7     6     3     3     2     11     4
5     6     0     0     7     15     7
3     3     9     null  7     12     7
7     null  7     null  7     12     13
null  7     4     6     12    13     4
null  5     7     6     null  14     7
2     6     0     0     2     13     3
7     null  7     null  2     13     7

659

asked Jan 04 '18 07:01

JoeBoggs

Video Answer

1 Answers

If you have pandas, this is pretty simple.

s = pd.Series(['A', 'A', 0, 'B', 0, 'A', np.nan])
s

0      A
1      A
2      0
3      B
4      0
5      A
6    NaN
dtype: object

Use replace to convert 0 to NaN -

s = s.replace({0 : np.nan, '0' : np.nan})
s

0      A
1      A
2    NaN
3      B
4    NaN
5      A
6    NaN
dtype: object

Now, call pd.get_dummies, which ignores NaN values.

pd.get_dummies(s)

   A  B
0  1  0
1  1  0
2  0  0
3  0  1
4  0  0
5  1  0
6  0  0

The solution is the same for a dataframe.

answered Oct 30 '22 13:10

cs95

Related questions
                            
                                Timedelta object cannot be converted with astype()
                            
                                asyncio loop's add_signal_handler() in Windows
                            
                                plotly: huge number of datapoints
                            
                                In macOS Sierra, How Configure AWS CLI to Use Python3.x Instead of the OS Default Python2.7?
                            
                                Check if a tkinter widget is visible
                            
                                Numpy: An efficient way to merge multiple slices [duplicate]
                            
                                Clear QLineEdit on click event
                            
                                Why is the endian reversed after sending over TCP
                            
                                Multiple plotly plots on 1 page without subplot
                            
                                How to visualize kmeans clustering on multidimensional data
                            
                                django-auth-ldap installation not working
                            
                                Mean Std in pandas data frame
                            
                                Checking if two arrays are broadcastable in python
                            
                                How to plot using matplotlib (python) colah's deformed grid?
                            
                                How to have predictions AND labels returned with tf.estimator (either with predict or eval method)?
                            
                                Draw line between two given points (OpenCV, Python)
                            
                                Plotting a 2D plane through a 3D surface
                            
                                how to write .npy file to s3 directly?
                            
                                Non-ASCII Python identifiers and reflectivity [duplicate]
                            
                                AUTH_USER_MODEL refers to model 'accounts.User' that has not been installed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sklearn - how to incorporate missing data when one-hot encoding

Tags:

python

numpy

scikit-learn

JoeBoggs

People also ask

Video Answer

1 Answers

cs95

Recent Activity

Donate For Us