Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create dummies for non-unique lists into column in Python

Currently I've the next dataframe:

import pandas as pd
df= pd.DataFrame({"ID" : ['1','2','3','4','5'], 
                     "col2" : [['a', 'b', 'c'], 
                               ['c', 'd', 'e', 'f'], 
                               ['f', 'b', 'f'], 
                               ['a', 'c', 'b'], 
                               ['b', 'a', 'b']]})

print(df)
  ID          col2
0  1     [a, b, c]
1  2  [c, d, e, f]
2  3     [f, b, f]
3  4     [a, c, b]
4  5     [b, a, d]

I want to create a new dataframe with dummies for col2, like this:

    ID   a   b   c   d   e   f
0   1    1   1   1   0   0   0
1   2    0   0   1   1   1   1
2   3    0   1   0   0   0   1
3   4    1   1   1   0   0   0
4   5    1   1   0   1   0   0

Using the following code generates different columns for each of the letters in the column list:

df2= df.col2.str.get_dummies(sep = ",")
pd.concat([data['col1'], df], axis=1)

ID  a   b   b]  c   c]  d   d]  e   f]  [a [b  [c  [f
1   0   1   0   0   1   0   0   0   0   1   0   0   0
2   0   0   0   0   0   1   0   1   1   0   0   1   0
3   0   1   0   0   0   0   0   0   1   0   0   0   1
4   0   0   1   1   0   0   0   0   0   1   0   0   0
5   1   0   0   0   0   0   1   0   0   0   1   0   0

Using the following code generates different columns for each of the letters in the list of the column according to the position in which they are. Does any of you have any idea why you might be going through this? The pd.get_dummies option also doesn't work.

like image 253
CArlos Cardona Avatar asked Dec 02 '16 18:12

CArlos Cardona


2 Answers

str.get_dummies works well on strings so you can turn your list into a something-separated-string and use str_get_dummies on that string. For example,

df['col2'].str.join('@').str.get_dummies('@')
Out: 
   a  b  c  d  e  f
0  1  1  1  0  0  0
1  0  0  1  1  1  1
2  0  1  0  0  0  1
3  1  1  1  0  0  0
4  1  1  0  0  0  0

Here, @ is an arbitrary character that does not appear in the list.

Then, you can concat as usual:

pd.concat([df['ID'], df['col2'].str.join('@').str.get_dummies('@')], axis=1)
Out: 
  ID  a  b  c  d  e  f
0  1  1  1  1  0  0  0
1  2  0  0  1  1  1  1
2  3  0  1  0  0  0  1
3  4  1  1  1  0  0  0
4  5  1  1  0  0  0  0
like image 125
ayhan Avatar answered Oct 06 '22 01:10

ayhan


Using comprehension dicts could be faster

In [40]: pd.DataFrame({k: 1 for k in x} for x in df.col2.values).fillna(0).astype(int)
Out[40]:
   a  b  c  d  e  f
0  1  1  1  0  0  0
1  0  0  1  1  1  1
2  0  1  0  0  0  1
3  1  1  1  0  0  0
4  1  1  0  0  0  0    

In [48]: pd.concat([
                df['ID'], 
                pd.DataFrame({k: 1 for k in x} for x in df.col2).fillna(0).astype(int)],
            axis=1)
Out[48]:
  ID  a  b  c  d  e  f
0  1  1  1  1  0  0  0
1  2  0  0  1  1  1  1
2  3  0  1  0  0  0  1
3  4  1  1  1  0  0  0
4  5  1  1  0  0  0  0

Timings

In [2942]: df.shape
Out[2942]: (50000, 2)

In [2945]: %timeit pd.DataFrame({k: 1 for k in x} for x in df.col2).fillna(0).astype(int)
10 loops, best of 3: 137 ms per loop

In [2946]: %timeit df['col2'].str.join('@').str.get_dummies('@')
1 loop, best of 3: 395 ms per loop
like image 45
Zero Avatar answered Oct 06 '22 00:10

Zero