Create dummies for non-unique lists into column in Python

Question

Currently I've the next dataframe:

import pandas as pd
df= pd.DataFrame({"ID" : ['1','2','3','4','5'], 
                     "col2" : [['a', 'b', 'c'], 
                               ['c', 'd', 'e', 'f'], 
                               ['f', 'b', 'f'], 
                               ['a', 'c', 'b'], 
                               ['b', 'a', 'b']]})

print(df)
  ID          col2
0  1     [a, b, c]
1  2  [c, d, e, f]
2  3     [f, b, f]
3  4     [a, c, b]
4  5     [b, a, d]

I want to create a new dataframe with dummies for col2, like this:

    ID   a   b   c   d   e   f
0   1    1   1   1   0   0   0
1   2    0   0   1   1   1   1
2   3    0   1   0   0   0   1
3   4    1   1   1   0   0   0
4   5    1   1   0   1   0   0

Using the following code generates different columns for each of the letters in the column list:

df2= df.col2.str.get_dummies(sep = ",")
pd.concat([data['col1'], df], axis=1)

ID  a   b   b]  c   c]  d   d]  e   f]  [a [b  [c  [f
1   0   1   0   0   1   0   0   0   0   1   0   0   0
2   0   0   0   0   0   1   0   1   1   0   0   1   0
3   0   1   0   0   0   0   0   0   1   0   0   0   1
4   0   0   1   1   0   0   0   0   0   1   0   0   0
5   1   0   0   0   0   0   1   0   0   0   1   0   0

Using the following code generates different columns for each of the letters in the list of the column according to the position in which they are. Does any of you have any idea why you might be going through this? The pd.get_dummies option also doesn't work.

ayhan · Accepted Answer

str.get_dummies works well on strings so you can turn your list into a something-separated-string and use str_get_dummies on that string. For example,

df['col2'].str.join('@').str.get_dummies('@')
Out: 
   a  b  c  d  e  f
0  1  1  1  0  0  0
1  0  0  1  1  1  1
2  0  1  0  0  0  1
3  1  1  1  0  0  0
4  1  1  0  0  0  0

Here, @ is an arbitrary character that does not appear in the list.

Then, you can concat as usual:

pd.concat([df['ID'], df['col2'].str.join('@').str.get_dummies('@')], axis=1)
Out: 
  ID  a  b  c  d  e  f
0  1  1  1  1  0  0  0
1  2  0  0  1  1  1  1
2  3  0  1  0  0  0  1
3  4  1  1  1  0  0  0
4  5  1  1  0  0  0  0

Zero · Answer

Using comprehension dicts could be faster

In [40]: pd.DataFrame({k: 1 for k in x} for x in df.col2.values).fillna(0).astype(int)
Out[40]:
   a  b  c  d  e  f
0  1  1  1  0  0  0
1  0  0  1  1  1  1
2  0  1  0  0  0  1
3  1  1  1  0  0  0
4  1  1  0  0  0  0    

In [48]: pd.concat([
                df['ID'], 
                pd.DataFrame({k: 1 for k in x} for x in df.col2).fillna(0).astype(int)],
            axis=1)
Out[48]:
  ID  a  b  c  d  e  f
0  1  1  1  1  0  0  0
1  2  0  0  1  1  1  1
2  3  0  1  0  0  0  1
3  4  1  1  1  0  0  0
4  5  1  1  0  0  0  0

Timings

In [2942]: df.shape
Out[2942]: (50000, 2)

In [2945]: %timeit pd.DataFrame({k: 1 for k in x} for x in df.col2).fillna(0).astype(int)
10 loops, best of 3: 137 ms per loop

In [2946]: %timeit df['col2'].str.join('@').str.get_dummies('@')
1 loop, best of 3: 395 ms per loop

Create dummies for non-unique lists into column in Python

Tags:

python

pandas

python-2.7

CArlos Cardona

2 Answers

ayhan

Zero

Recent Activity

Donate For Us

Create dummies for non-unique lists into column in Python

Tags:

python

pandas

python-2.7

CArlos Cardona

2 Answers

ayhan

Zero

Related questions

Recent Activity

Donate For Us