Convert pandas column of lists into matrix representation (One Hot Encoding)

Tags:

I have a pandas column with lists of values of varying length like so:

  idx lists

    0 [1,3,4,5]
    1 [2]
    2 [3,5]
    3 [2,3,5]

I'd like to convert them into a matrix format where each possible value represents a column and each row populates a 1 if the value exists and 0 otherwise, like so:

idx  1 2 3 4 5 

  0  1 0 1 1 1
  1  0 1 0 0 0
  2  0 0 1 0 1
  3  0 1 1 0 1

I thought the term for this was one hot encoding, but I tried to use the pd.get_dummies method which states it can do one-hot encoding, but when I try to feed input as shown above:

test_hot = pd.Series([[1,2,3],[3,4,5],[1,6]])
pd.get_dummies(test_hot)

I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 899, in get_dummies
    dtype=dtype)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 906, in _get_dummies_1d
    codes, levels = _factorize_from_iterable(Series(data))
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 2515, in _factorize_from_iterable
    cat = Categorical(values, ordered=True)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 347, in __init__
    codes, categories = factorize(values, sort=False)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/algorithms.py", line 630, in factorize
    na_value=na_value)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/algorithms.py", line 476, in _factorize_array
    na_value=na_value)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_labels
TypeError: unhashable type: 'list'

The method works fine if I'm feeding a single list of values such as:

[1,2,3,4,5]

It will show a 5x5 matrix but only populates a single row with a 1. I'm trying to expand this so that more than 1 value can be populated per row by feeding a column of lists.

389

asked Apr 14 '19 08:04

Ben C Wang

2 Answers

If performance is important use MultiLabelBinarizer:

test_hot = pd.Series([[1,2,3],[3,4,5],[1,6]])

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(test_hot),columns=mlb.classes_)
print (df)
   1  2  3  4  5  6
0  1  1  1  0  0  0
1  0  0  1  1  1  0
2  1  0  0  0  0  1

Your solution should be changed with create DataFrame, reshape and DataFrame.stack, last using get_dummies with DataFrame.max for aggregate:

df = pd.get_dummies(pd.DataFrame(test_hot.values.tolist()).stack().astype(int))
       .max(level=0, axis=0)

print (df)
   1  2  3  4  5  6
0  1  1  1  0  0  0
1  0  0  1  1  1  0
2  1  0  0  0  0  1

Details:

Created MultiIndex Series:

print(pd.DataFrame(test_hot.values.tolist()).stack().astype(int))
0  0    1
   1    2
   2    3
1  0    3
   1    4
   2    5
2  0    1
   1    6
dtype: int32

Call pd.get_dummies:

print (pd.get_dummies(pd.DataFrame(test_hot.values.tolist()).stack().astype(int)))
     1  2  3  4  5  6
0 0  1  0  0  0  0  0
  1  0  1  0  0  0  0
  2  0  0  1  0  0  0
1 0  0  0  1  0  0  0
  1  0  0  0  1  0  0
  2  0  0  0  0  1  0
2 0  1  0  0  0  0  0
  1  0  0  0  0  0  1

And last aggregate max per first level.

answered Sep 22 '22 14:09

jezrael

Fixing your get_dummies code, you can use:

df['lists'].map(lambda x: ','.join(map(str, x))).str.get_dummies(sep=',')

   1  2  3  4  5
0  1  0  1  1  1
1  0  1  0  0  0
2  0  0  1  0  1
3  0  1  1  0  1

answered Sep 19 '22 14:09

cs95

Related questions
                            
                                How are the contents of the builtins module available in the global namespace without import in Python?
                            
                                how to print the default value if argument is None in python
                            
                                Merge on one column or another
                            
                                Weighted histogram plotly
                            
                                How are the output size of MaxPooling2D, Conv2D, UpSampling2D layers calculated?
                            
                                word cloud does not show the frequency of the words correctly
                            
                                Can't store downloaded files in their concerning folders
                            
                                after installing uwsgi, python will still error: No module named 'uwsgi'
                            
                                gRPC server in Python with Unix domain socket
                            
                                Is it possible to link the interactive python window to a running jupyter notebook kernel?
                            
                                airflow cleared tasks not getting executed
                            
                                How to install pip specifically for Python3 on CentOS 7?
                            
                                Recursively combine dictionaries
                            
                                jupyterlab doesn't display png image
                            
                                Tensorflow model.fit() using a Dataset generator
                            
                                Fix the code to get rid of ValueError: cannot set using a multi-index selection indexer with a different length
                            
                                Drop Rows of an id after a particular column value in Pandas
                            
                                How to assign arbitrary metadata to pyarrow.Table / Parquet columns
                            
                                What is the difference between tf-nightly-gpu and tensorflow-gpu
                            
                                Merging "add" form in Django Admin from 2 or more Models (connected with one-to-one relationship)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert pandas column of lists into matrix representation (One Hot Encoding)

Tags:

python

list

pandas

Ben C Wang

People also ask

2 Answers

jezrael

cs95

Recent Activity

Donate For Us