Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert pandas column of lists into matrix representation (One Hot Encoding)

I have a pandas column with lists of values of varying length like so:

  idx lists

    0 [1,3,4,5]
    1 [2]
    2 [3,5]
    3 [2,3,5]

I'd like to convert them into a matrix format where each possible value represents a column and each row populates a 1 if the value exists and 0 otherwise, like so:

idx  1 2 3 4 5 

  0  1 0 1 1 1
  1  0 1 0 0 0
  2  0 0 1 0 1
  3  0 1 1 0 1

I thought the term for this was one hot encoding, but I tried to use the pd.get_dummies method which states it can do one-hot encoding, but when I try to feed input as shown above:

test_hot = pd.Series([[1,2,3],[3,4,5],[1,6]])
pd.get_dummies(test_hot)

I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 899, in get_dummies
    dtype=dtype)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 906, in _get_dummies_1d
    codes, levels = _factorize_from_iterable(Series(data))
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 2515, in _factorize_from_iterable
    cat = Categorical(values, ordered=True)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 347, in __init__
    codes, categories = factorize(values, sort=False)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/algorithms.py", line 630, in factorize
    na_value=na_value)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/algorithms.py", line 476, in _factorize_array
    na_value=na_value)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_labels
TypeError: unhashable type: 'list'

The method works fine if I'm feeding a single list of values such as:

[1,2,3,4,5]

It will show a 5x5 matrix but only populates a single row with a 1. I'm trying to expand this so that more than 1 value can be populated per row by feeding a column of lists.

like image 389
Ben C Wang Avatar asked Apr 14 '19 08:04

Ben C Wang


People also ask

Which function in pandas is used for one hot encoding?

pandas as has inbuilt function "get_dummies" to get one hot encoding of that particular column/s.


2 Answers

If performance is important use MultiLabelBinarizer:

test_hot = pd.Series([[1,2,3],[3,4,5],[1,6]])

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(test_hot),columns=mlb.classes_)
print (df)
   1  2  3  4  5  6
0  1  1  1  0  0  0
1  0  0  1  1  1  0
2  1  0  0  0  0  1

Your solution should be changed with create DataFrame, reshape and DataFrame.stack, last using get_dummies with DataFrame.max for aggregate:

df = pd.get_dummies(pd.DataFrame(test_hot.values.tolist()).stack().astype(int))
       .max(level=0, axis=0)

print (df)
   1  2  3  4  5  6
0  1  1  1  0  0  0
1  0  0  1  1  1  0
2  1  0  0  0  0  1

Details:

Created MultiIndex Series:

print(pd.DataFrame(test_hot.values.tolist()).stack().astype(int))
0  0    1
   1    2
   2    3
1  0    3
   1    4
   2    5
2  0    1
   1    6
dtype: int32

Call pd.get_dummies:

print (pd.get_dummies(pd.DataFrame(test_hot.values.tolist()).stack().astype(int)))
     1  2  3  4  5  6
0 0  1  0  0  0  0  0
  1  0  1  0  0  0  0
  2  0  0  1  0  0  0
1 0  0  0  1  0  0  0
  1  0  0  0  1  0  0
  2  0  0  0  0  1  0
2 0  1  0  0  0  0  0
  1  0  0  0  0  0  1

And last aggregate max per first level.

like image 83
jezrael Avatar answered Sep 22 '22 14:09

jezrael


Fixing your get_dummies code, you can use:

df['lists'].map(lambda x: ','.join(map(str, x))).str.get_dummies(sep=',')

   1  2  3  4  5
0  1  0  1  1  1
1  0  1  0  0  0
2  0  0  1  0  1
3  0  1  1  0  1
like image 31
cs95 Avatar answered Sep 19 '22 14:09

cs95