Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas convert a column of list to dummies

I have a dataframe where one column is a list of groups each of my users belongs to. Something like:

index groups  
0     ['a','b','c']
1     ['c']
2     ['b','c','e']
3     ['a','c']
4     ['b','e']

And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses

index  a   b   c   d   e
0      1   1   1   0   0
1      0   0   1   0   0
2      0   1   1   0   1
3      1   0   1   0   0
4      0   1   0   0   0


pd.get_dummies(df['groups'])

won't work because that just returns a column for each different list in my column.

The solution needs to be efficient as the dataframe will contain 500,000+ rows.

like image 379
user2900369 Avatar asked Mar 13 '15 14:03

user2900369


3 Answers

Using s for your df['groups']:

In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })

In [22]: s
Out[22]:
0    [a, b, c]
1          [c]
2    [b, c, e]
3       [a, c]
4       [b, e]
dtype: object

This is a possible solution:

In [23]: pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
Out[23]:
   a  b  c  e
0  1  1  1  0
1  0  0  1  0
2  0  1  1  1
3  1  0  1  0
4  0  1  0  1

The logic of this is:

  • .apply(Series) converts the series of lists to a dataframe
  • .stack() puts everything in one column again (creating a multi-level index)
  • pd.get_dummies( ) creating the dummies
  • .sum(level=0) for remerging the different rows that should be one row (by summing up the second level, only keeping the original level (level=0))

An slight equivalent is pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)

If this will be efficient enough, I don't know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.

like image 66
joris Avatar answered Oct 14 '22 07:10

joris


Very fast solution in case you have a large dataframe

Using sklearn.preprocessing.MultiLabelBinarizer

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

df = pd.DataFrame(
    {'groups':
        [['a','b','c'],
        ['c'],
        ['b','c','e'],
        ['a','c'],
        ['b','e']]
    }, columns=['groups'])

s = df['groups']

mlb = MultiLabelBinarizer()

pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)

Result:

    a   b   c   e
0   1   1   1   0
1   0   0   1   0
2   0   1   1   1
3   1   0   1   0
4   0   1   0   1

Worked for me and also was suggested here and here

like image 50
Teoretic Avatar answered Oct 14 '22 05:10

Teoretic


This is even faster: pd.get_dummies(df['groups'].explode()).sum(level=0)

Using .explode() instead of .apply(pd.Series).stack()

Comparing with the other solutions:

import timeit
import pandas as pd
setup = '''
import time
import pandas as pd
s = pd.Series({0:['a','b','c'],1:['c'],2:['b','c','e'],3:['a','c'],4:['b','e']})
df = s.rename('groups').to_frame()
'''
m1 = "pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)"
m2 = "df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')"
m3 = "pd.get_dummies(df['groups'].explode()).sum(level=0)"
times = {f"m{i+1}":min(timeit.Timer(m, setup=setup).repeat(7, 1000)) for i, m in enumerate([m1, m2, m3])}
pd.DataFrame([times],index=['ms'])
#           m1        m2        m3
# ms  5.586517  3.821662  2.547167
like image 13
RBA Avatar answered Oct 14 '22 06:10

RBA