How to unnest (explode) a column in a pandas DataFrame

Tags:

I have the following DataFrame where one of the columns is an object (list type cell):

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]}) df Out[458]:     A       B 0  1  [1, 2] 1  2  [1, 2]

My expected output is:

   A  B 0  1  1 1  1  2 3  2  1 4  2  2

What should I do to achieve this?

BENY

2 Answers

I know object dtype columns makes the data hard to convert with pandas functions. When I receive data like this, the first thing that came to mind was to "flatten" or unnest the columns.

I am using pandas and Python functions for this type of question. If you are worried about the speed of the above solutions, check out user3483203's answer, since it's using numpy and most of the time numpy is faster. I recommend Cython or numba if speed matters.

Method 0 [pandas >= 0.25] Starting from pandas 0.25, if you only need to explode one column, you can use the pandas.DataFrame.explode function:

df.explode('B')         A  B     0  1  1     1  1  2     0  2  1     1  2  2

Given a dataframe with an empty list or a NaN in the column. An empty list will not cause an issue, but a NaN will need to be filled with a list

df = pd.DataFrame({'A': [1, 2, 3, 4],'B': [[1, 2], [1, 2], [], np.nan]}) df.B = df.B.fillna({i: [] for i in df.index})  # replace NaN with [] df.explode('B')     A    B 0  1    1 0  1    2 1  2    1 1  2    2 2  3  NaN 3  4  NaN

Method 1 apply + pd.Series (easy to understand but in terms of performance not recommended . )

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'}) Out[463]:    A  B 0  1  1 1  1  2 0  2  1 1  2  2

Method 2 Using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)}) df Out[465]:    A  B 0  1  1 0  1  2 1  2  1 1  2  2

Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after 'unnest' the single columns

s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len())) s.join(df.drop('B',1),how='left') Out[477]:    B  A 0  1  1 0  2  1 1  1  2 1  2  2

If you need the column order exactly the same as before, add reindex at the end.

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)

Method 3 recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns) Out[488]:    A  B 0  1  1 1  1  2 2  2  1 3  2  2

If more than two columns, use

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y]) s.merge(df,left_on=0,right_index=True) Out[491]:    0  1  A       B 0  0  1  1  [1, 2] 1  0  2  1  [1, 2] 2  1  1  2  [1, 2] 3  1  2  2  [1, 2]

Method 4 using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values)) Out[554]:    A  B 0  1  1 0  1  2 1  2  1 1  2  2  #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))

Method 5 when the list only contains unique values:

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]}) from collections import ChainMap d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A']))) pd.DataFrame(list(d.items()),columns=df.columns[::-1]) Out[574]:    B  A 0  1  1 1  2  1 2  3  2 3  4  2

Method 6 using numpy for high performance:

newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values))) pd.DataFrame(data=newvalues[0],columns=df.columns)    A  B 0  1  1 1  1  2 2  2  1 3  2  2

Method 7 using base function itertools cycle and chain: Pure python solution just for fun

from itertools import cycle,chain l=df.values.tolist() l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l] pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)    A  B 0  1  1 1  1  2 2  2  1 3  2  2

Generalizing to multiple columns

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]}) df Out[592]:    A       B       C 0  1  [1, 2]  [1, 2] 1  2  [3, 4]  [3, 4]

Self-def function:

def unnesting(df, explode):     idx = df.index.repeat(df[explode[0]].str.len())     df1 = pd.concat([         pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)     df1.index = idx      return df1.join(df.drop(explode, 1), how='left')   unnesting(df,['B','C']) Out[609]:    B  C  A 0  1  1  1 0  2  2  1 1  3  3  2 1  4  4  2

Column-wise Unnesting

All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame constructor

df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_')) Out[33]:    A       B       C  B_0  B_1 0  1  [1, 2]  [1, 2]    1    2 1  2  [3, 4]  [3, 4]    3    4

Updated function

def unnesting(df, explode, axis):     if axis==1:         idx = df.index.repeat(df[explode[0]].str.len())         df1 = pd.concat([             pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)         df1.index = idx          return df1.join(df.drop(explode, 1), how='left')     else :         df1 = pd.concat([                          pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)         return df1.join(df.drop(explode, 1), how='left')

Test Output

unnesting(df, ['B','C'], axis=0) Out[36]:    B0  B1  C0  C1  A 0   1   2   1   2  1 1   3   4   3   4  2

Update 2021-02-17 with original explode function

def unnesting(df, explode, axis):     if axis==1:         df1 = pd.concat([df[x].explode() for x in explode], axis=1)         return df1.join(df.drop(explode, 1), how='left')     else :         df1 = pd.concat([                          pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)         return df1.join(df.drop(explode, 1), how='left')

169

answered Sep 20 '22 20:09

BENY

Option 1

If all of the sublists in the other column are the same length, numpy can be an efficient option here:

vals = np.array(df.B.values.tolist())     a = np.repeat(df.A, vals.shape[1])  pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

   A  B 0  1  1 1  1  2 2  2  1 3  2  2

Option 2

If the sublists have different length, you need an additional step:

vals = df.B.values.tolist() rs = [len(r) for r in vals]     a = np.repeat(df.A, rs)  pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

   A  B 0  1  1 1  1  2 2  2  1 3  2  2

Option 3

I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:

df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],                    'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})

   A          B          C  D 0  1     [1, 2]  [1, 2, 3]  A 1  2  [1, 2, 3]     [1, 2]  B 2  3        [1]     [1, 2]  C

def unnest(df, tile, explode):     vals = df[explode].sum(1)     rs = [len(r) for r in vals]     a = np.repeat(df[tile].values, rs, axis=0)     b = np.concatenate(vals.values)     d = np.column_stack((a, b))     return pd.DataFrame(d, columns = tile +  ['_'.join(explode)])  unnest(df, ['A', 'D'], ['B', 'C'])

    A  D B_C 0   1  A   1 1   1  A   2 2   1  A   1 3   1  A   2 4   1  A   3 5   2  B   1 6   2  B   2 7   2  B   3 8   2  B   1 9   2  B   2 10  3  C   1 11  3  C   1 12  3  C   2

Functions

def wen1(df):     return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})  def wen2(df):     return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})  def wen3(df):     s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))     return s.join(df.drop('B', 1), how='left')  def wen4(df):     return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)  def chris1(df):     vals = np.array(df.B.values.tolist())     a = np.repeat(df.A, vals.shape[1])     return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)  def chris2(df):     vals = df.B.values.tolist()     rs = [len(r) for r in vals]     a = np.repeat(df.A.values, rs)     return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Timings

import pandas as pd import matplotlib.pyplot as plt import numpy as np from timeit import timeit  res = pd.DataFrame(        index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],        columns=[10, 50, 100, 500, 1000, 5000, 10000],        dtype=float )  for f in res.index:     for c in res.columns:         df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})         df = pd.concat([df]*c)         stmt = '{}(df)'.format(f)         setp = 'from __main__ import df, {}'.format(f)         res.at[f, c] = timeit(stmt, setp, number=50)  ax = res.div(res.min()).T.plot(loglog=True) ax.set_xlabel("N") ax.set_ylabel("time (relative)")

Performance

enter image description here

answered Sep 17 '22 20:09

user3483203

Related questions
                            
                                What are the differences between the threading and multiprocessing modules?
                            
                                Numpy where function multiple conditions
                            
                                What's a standard way to do a no-op in python?
                            
                                Any way to clear python's IDLE window?
                            
                                type object 'datetime.datetime' has no attribute 'datetime'
                            
                                How do you use the ellipsis slicing syntax in Python?
                            
                                How do you create nested dict in Python?
                            
                                Basic http file downloading and saving to disk in python?
                            
                                Counting the number of True Booleans in a Python List
                            
                                How to get item's position in a list?
                            
                                Does Python SciPy need BLAS?
                            
                                setting an environment variable in virtualenv
                            
                                Slicing of a NumPy 2d array, or how do I extract an mxm submatrix from an nxn array (n>m)?
                            
                                Let JSON object accept bytes or let urlopen output strings
                            
                                How to print a groupby object
                            
                                numpy.where() detailed, step-by-step explanation / examples [closed]
                            
                                How to activate an Anaconda environment
                            
                                Set Colorbar Range in matplotlib
                            
                                Convert columns into rows with Pandas
                            
                                A column-vector y was passed when a 1d array was expected

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to unnest (explode) a column in a pandas DataFrame

Tags:

python

pandas

dataframe

BENY

People also ask

2 Answers

Column-wise Unnesting

BENY

user3483203

Recent Activity

Donate For Us