Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to one-hot encode a vector with arbitrary number of unique values?

Given a output vector y that can have any number of discrete values, say from the set {1, 2, 3, 4}. Let the vector have this instance:

y = [1, 1, 2, 4, 2, 3, 1]

Is there a numpy-tonic library way of generating a one-hot encoded representation of this vector? I.e. such that

y_enc =
y1 y2 y3 y4
1  0  0  0
1  0  0  0
0  1  0  0
0  0  0  1
0  1  0  0
0  0  1  0
1  0  0  0

For the case {0, 1} i have a small example but I cant see this going in the right direction:

>>> k
array([[0.],
       [0.],
       [0.],
       [0.],
       [1.]])
>>> z = np.zeros((5,2))
>>> z
array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])
>>> for i,ki in enumerate(k):
...   if (ki == 0):
...     z[i][0] += 1
...   if (ki == 1):
...     z[i][1] += 1
... 
>>> z
array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])
>>>
like image 391
TMOTTM Avatar asked Oct 24 '25 14:10

TMOTTM


2 Answers

I find it most convenient by creating an identity matrix and indexing it with y by using "fancy indexing"

y = np.array([1, 1, 2, 4, 2, 3, 1])
one_hot = np.eye(5, dtype=np.int32)[y]

Now one_hot will be

array([[0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0]], dtype=int32)

And since you are not using the value 0 you could slice away the first column by doing

one_hot[:,1:]

Output:

array([[1, 0, 0, 0],
       [1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 0, 1],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [1, 0, 0, 0]], dtype=int32)
like image 192
Kevin Avatar answered Oct 26 '25 05:10

Kevin


Use csr_matrix. This will be the best as One Hot encoding is done if there are lots of Zeros. Efficient for Huge sparse Datasets as well.

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

y = np.array([1, 1, 2, 4, 2, 3, 1])

y_unique = np.unique(y)

data = np.ones(y.shape[0])

row_idx = np.arange(y.shape[0])
col_idx = np.searchsorted(y_unique,y)

num_rows = y.shape[0]
num_cols = num_unique_y = y_unique.shape[0]
'''
Create the Compressed Sparse Row (CSR) matrix
This is an efficient way to store a sparse matrix (many zeros)
The format is (data, (row_indices, column_indices)), shape=(number of rows, number of columns)

'''
y_csr_matrix = csr_matrix(
(data,(row_idx,col_idx)), shape = (num_rows,num_cols)    
)
# Convert the sparse matrix to a dense NumPy array
y_dense_array = y_csr_matrix.astype(int).toarray()

# Create column names for the DataFrame
column_names = [f'col_{i}' for i in y_unique]  

# Create the pandas DataFrame
y_dataframe = pd.DataFrame(y_dense_array, columns=column_names)

print("\nDataFrame from Sparse Matrix:")
print(y_dataframe)
'''
DataFrame from Sparse Matrix:
   col_1  col_2  col_3  col_4
0      1      0      0      0
1      1      0      0      0
2      0      1      0      0
3      0      0      0      1
4      0      1      0      0
5      0      0      1      0
6      1      0      0      0
'''
like image 34
Soudipta Dutta Avatar answered Oct 26 '25 05:10

Soudipta Dutta



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!