Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas split a column of unequal length lists into multiple boolean columns

Tags:

python

pandas

Given a DataFrame df1 as follows:

df1 = pd.DataFrame({
    'col1': [1,2,3,4],
    'col2': [['a', 'b'], ['c'], ['a', 'd', 'b'], ['e']]
})

Which looks like:

    col1    col2
0   1       [a, b]
1   2       [c]
2   3       [a, d, b]
3   4       [e]

I want to convert col2 - a column where each cell is a list - into several columns (a, b, c, d, e), where the values are boolean entries defining whether that column name existed in the original list, in the given row.

The output should follow this form:

df2 = pd.DataFrame({
    'col1': [1,2,3,4],
    'a': [True, False, True, False],
    'b': [True, False, True, False],
    'c': [False, True, False, False],
    'd': [False, False, True, False],
    'e': [False, False, False, True]
    
})

Which looks like:

    col1    a       b       c       d       e
0   1       True    True    False   False   False
1   2       False   False   True    False   False
2   3       True    True    False   True    False
3   4       False   False   False   False   True

What's a clean way to do this?

like image 751
Ian Avatar asked Oct 27 '25 15:10

Ian


2 Answers

Alternative approach using str.get_dummies probably more efficient than apply + pd.Series:

df1['col2'].str.join(',').str.get_dummies(sep=',').astype(bool)

       a      b      c      d      e
0   True   True  False  False  False
1  False  False   True  False  False
2   True   True  False   True  False
3  False  False  False  False   True

Timings:

df1.shape
(40000, 2)

%%timeit
df1['col2'].str.join(',').str.get_dummies(sep=',').astype(bool)
286 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.get_dummies(df1['col2'].apply(pd.Series).stack()).sum(level=0)
9.43 s ± 499 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 187
Shubham Sharma Avatar answered Oct 29 '25 05:10

Shubham Sharma


One approach

df2 = df1.reset_index().explode('col2')
new_df = pd.crosstab(df2['col1'], df2['col2']).astype(bool)
print(new_df)

col2      a      b      c      d      e
col1                                   
1      True   True  False  False  False
2     False  False   True  False  False
3      True   True  False   True  False
4     False  False  False  False   True

Or

new_df = (df1.reset_index()
              .explode('col2')
              .groupby(['index', 'col2'])['col1']
              .any()
              .unstack(fill_value=False))

new_df = (df1.reset_index()
             .explode('col2')
             .pivot_table(index='index', 
                          values='col1', 
                          columns='col2', 
                          aggfunc='any', 
                         fill_value=False) 
         )
like image 30
ansev Avatar answered Oct 29 '25 06:10

ansev