Given a output vector y that can have any number of discrete values, say from the set {1, 2, 3, 4}. Let the vector have this instance:
y = [1, 1, 2, 4, 2, 3, 1]
Is there a numpy-tonic library way of generating a one-hot encoded representation of this vector? I.e. such that
y_enc =
y1 y2 y3 y4
1 0 0 0
1 0 0 0
0 1 0 0
0 0 0 1
0 1 0 0
0 0 1 0
1 0 0 0
For the case {0, 1} i have a small example but I cant see this going in the right direction:
>>> k
array([[0.],
[0.],
[0.],
[0.],
[1.]])
>>> z = np.zeros((5,2))
>>> z
array([[0., 0.],
[0., 0.],
[0., 0.],
[0., 0.],
[0., 0.]])
>>> for i,ki in enumerate(k):
... if (ki == 0):
... z[i][0] += 1
... if (ki == 1):
... z[i][1] += 1
...
>>> z
array([[1., 0.],
[1., 0.],
[1., 0.],
[1., 0.],
[0., 1.]])
>>>
I find it most convenient by creating an identity matrix and indexing it with y by using "fancy indexing"
y = np.array([1, 1, 2, 4, 2, 3, 1])
one_hot = np.eye(5, dtype=np.int32)[y]
Now one_hot will be
array([[0, 1, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[0, 1, 0, 0, 0]], dtype=int32)
And since you are not using the value 0 you could slice away the first column by doing
one_hot[:,1:]
Output:
array([[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 0, 1],
[0, 1, 0, 0],
[0, 0, 1, 0],
[1, 0, 0, 0]], dtype=int32)
Use csr_matrix. This will be the best as One Hot encoding is done if there are lots of Zeros. Efficient for Huge sparse Datasets as well.
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
y = np.array([1, 1, 2, 4, 2, 3, 1])
y_unique = np.unique(y)
data = np.ones(y.shape[0])
row_idx = np.arange(y.shape[0])
col_idx = np.searchsorted(y_unique,y)
num_rows = y.shape[0]
num_cols = num_unique_y = y_unique.shape[0]
'''
Create the Compressed Sparse Row (CSR) matrix
This is an efficient way to store a sparse matrix (many zeros)
The format is (data, (row_indices, column_indices)), shape=(number of rows, number of columns)
'''
y_csr_matrix = csr_matrix(
(data,(row_idx,col_idx)), shape = (num_rows,num_cols)
)
# Convert the sparse matrix to a dense NumPy array
y_dense_array = y_csr_matrix.astype(int).toarray()
# Create column names for the DataFrame
column_names = [f'col_{i}' for i in y_unique]
# Create the pandas DataFrame
y_dataframe = pd.DataFrame(y_dense_array, columns=column_names)
print("\nDataFrame from Sparse Matrix:")
print(y_dataframe)
'''
DataFrame from Sparse Matrix:
col_1 col_2 col_3 col_4
0 1 0 0 0
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1
4 0 1 0 0
5 0 0 1 0
6 1 0 0 0
'''
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With