Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explain onehotencoder using python

I am new to scikit-learn library and have been trying to play with it for prediction of stock prices. I was going through its documentation and got stuck at the part where they explain OneHotEncoder(). Here is the code that they have used :

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Can someone please explain it to me step by step what is happening here? I have a clear idea how One hot encoder works but I'm not able to figure out how this code works. Any help is appreciated. Thanks!

like image 647
Shashwat Siddhant Avatar asked Mar 10 '17 22:03

Shashwat Siddhant


2 Answers

Lets start off first by writing down what you would expect (assuming you know what One Hot Encoding means)

unecoded

f0 f1 f2
0, 0, 3
1, 1, 0
0, 2, 1
1, 0, 2

encoded

|f0|  |  f1 |  |   f2   |

1, 0, 1, 0, 0, 0, 0, 0, 1 
0, 1, 0, 1, 0, 1, 0, 0, 0
1, 0, 0, 0, 1, 0, 1, 0, 0
0, 1, 1, 0, 0, 0, 0, 1, 0

To get encoded:

enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]),

if you use the default n_values='auto'. In using default='auto' you're specifying that the values your features (columns of unencoded) could possibly take on can be inferred from the values in the columns of the data handed to fit.

That brings us to enc.n_values_

from the docs:

Number of values per feature.

enc.n_values_
array([2, 3, 4])

The above means that f0 (column 1) can take on 2 values (0, 1), f1 can take on 3 values, (0, 1, 2) and f2 can take on 4 values (0, 1, 2, 3).

Indeed these are the values from the features f1, f2 ,f3 in the unencoded feature matrix.

then,

enc.feature_indices_
array([0, 2, 5, 9])

from the docs:

Indices to feature ranges. Feature i in the original data is mapped to features from feature_indices_[i] to feature_indices_[i+1] (and then potentially masked by active_features_ afterwards)

Given is the range of positions (in the encoded space) that features f1, f2, f3 can take on.

f1: [0, 1], f2: [2, 3, 4], f3: [5, 6, 7, 8]

Mapping the vector [0, 1, 1] into one hot encoded space (under the mapping by we got from enc.fit):

1, 0, 0, 1, 0, 0, 1, 0, 0

How?

The first feature in the f0 so that maps to position 0 (if the element was 1 instead of 0 we would map it into position 1).

The next element 1 maps into position 3 because f1 starts at position 2 and the element 1 is the second possible value f1 can take on.

Finally the third element 1 takes on position 6 since it the second possible value f2 takes on and f2 starts getting mapped from position 5.

Hope that clears up some stuff.

like image 128
parsethis Avatar answered Nov 16 '22 04:11

parsethis


Let's take these features one at a time:

>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])

We're fitting an encoder to a set of four vectors, with 3 features each.

>>> enc.n_values_
array([2, 3, 4])
  • The 1st feature has 2 possible values: 0, 1
  • The 2nd feature has 3 possible values: 0, 1, 2
  • The 3rd feature has 4 possible values: 0, 1, 2, 3

Clear?

>>> enc.feature_indices_
array([0, 2, 5, 9])

The representation will concatenate the vectors for the three features. Since there are three features, the representation will always have three "True" entries (1), the rest "False" (0).

Since there are 2+3+4 possible values, the representation is 9 entries long.

  • Feature 1 starts at index 0
  • Feature 2 starts at index 2 (F1 start + len(F1))
  • Feature 3 starts at index 5 (F2 start + len(F2))

End barricade at index 9

>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Encoding the given values simply concatenates the three one-vectors, for the values 0, 1, 1:

  • F1: [1, 0]
  • F2: [0, 1, 0]
  • F3: [0, 1, 0, 0]

Slap those end-to-end, convert to the given float format, and we have the array shown in the example.

like image 39
Prune Avatar answered Nov 16 '22 03:11

Prune