I have the following DataFrame where one of the columns is an object (list type cell):
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]}) df Out[458]: A B 0 1 [1, 2] 1 2 [1, 2]
My expected output is:
A B 0 1 1 1 1 2 3 2 1 4 2 2
What should I do to achieve this?
Related question
pandas: When cell contents are lists, create a row for each element in the list
Good question and answer but only handle one column with list(In my answer the self-def function will work for multiple columns, also the accepted answer is use the most time consuming apply
, which is not recommended, check more info When should I ever want to use pandas apply() in my code?)
Pandas DataFrame: explode() functionThe explode() function is used to transform each element of a list-like to a row, replicating the index values. Exploded lists to rows of the subset columns; index will be duplicated for these rows. Raises: ValueError - if columns of the frame are not unique.
If some of the elements in the series object have lists, then we can unnest those list elements into multiple rows of the series object. Unnesting is nothing but exploding the lists into rows. So this transformation can be done easily with the help of the pandas series. explode() method.
Column(s) to explode. For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length. If True, the resulting index will be labeled 0, 1, …, n - 1. New in version 1.1.
Split column by delimiter into multiple columnsApply the pandas series str. split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.
I know object
dtype columns makes the data hard to convert with pandas functions. When I receive data like this, the first thing that came to mind was to "flatten" or unnest the columns.
I am using pandas and Python functions for this type of question. If you are worried about the speed of the above solutions, check out user3483203's answer, since it's using numpy and most of the time numpy is faster. I recommend Cython or numba if speed matters.
Method 0 [pandas >= 0.25] Starting from pandas 0.25, if you only need to explode one column, you can use the pandas.DataFrame.explode
function:
df.explode('B') A B 0 1 1 1 1 2 0 2 1 1 2 2
Given a dataframe with an empty list
or a NaN
in the column. An empty list will not cause an issue, but a NaN
will need to be filled with a list
df = pd.DataFrame({'A': [1, 2, 3, 4],'B': [[1, 2], [1, 2], [], np.nan]}) df.B = df.B.fillna({i: [] for i in df.index}) # replace NaN with [] df.explode('B') A B 0 1 1 0 1 2 1 2 1 1 2 2 2 3 NaN 3 4 NaN
Method 1 apply + pd.Series
(easy to understand but in terms of performance not recommended . )
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'}) Out[463]: A B 0 1 1 1 1 2 0 2 1 1 2 2
Method 2 Using repeat
with DataFrame
constructor , re-create your dataframe (good at performance, not good at multiple columns )
df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)}) df Out[465]: A B 0 1 1 0 1 2 1 2 1 1 2 2
Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .
Solution : join
or merge
with the index
after 'unnest' the single columns
s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len())) s.join(df.drop('B',1),how='left') Out[477]: B A 0 1 1 0 2 1 1 1 2 1 2 2
If you need the column order exactly the same as before, add reindex
at the end.
s.join(df.drop('B',1),how='left').reindex(columns=df.columns)
Method 3 recreate the list
pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns) Out[488]: A B 0 1 1 1 1 2 2 2 1 3 2 2
If more than two columns, use
s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y]) s.merge(df,left_on=0,right_index=True) Out[491]: 0 1 A B 0 0 1 1 [1, 2] 1 0 2 1 [1, 2] 2 1 1 2 [1, 2] 3 1 2 2 [1, 2]
Method 4 using reindex
or loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values)) Out[554]: A B 0 1 1 0 1 2 1 2 1 1 2 2 #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
Method 5 when the list only contains unique values:
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]}) from collections import ChainMap d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A']))) pd.DataFrame(list(d.items()),columns=df.columns[::-1]) Out[574]: B A 0 1 1 1 2 1 2 3 2 3 4 2
Method 6 using numpy
for high performance:
newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values))) pd.DataFrame(data=newvalues[0],columns=df.columns) A B 0 1 1 1 1 2 2 2 1 3 2 2
Method 7 using base function itertools
cycle
and chain
: Pure python solution just for fun
from itertools import cycle,chain l=df.values.tolist() l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l] pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns) A B 0 1 1 1 1 2 2 2 1 3 2 2
Generalizing to multiple columns
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]}) df Out[592]: A B C 0 1 [1, 2] [1, 2] 1 2 [3, 4] [3, 4]
Self-def function:
def unnesting(df, explode): idx = df.index.repeat(df[explode[0]].str.len()) df1 = pd.concat([ pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1) df1.index = idx return df1.join(df.drop(explode, 1), how='left') unnesting(df,['B','C']) Out[609]: B C A 0 1 1 1 0 2 2 1 1 3 3 2 1 4 4 2
All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame
constructor
df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_')) Out[33]: A B C B_0 B_1 0 1 [1, 2] [1, 2] 1 2 1 2 [3, 4] [3, 4] 3 4
Updated function
def unnesting(df, explode, axis): if axis==1: idx = df.index.repeat(df[explode[0]].str.len()) df1 = pd.concat([ pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1) df1.index = idx return df1.join(df.drop(explode, 1), how='left') else : df1 = pd.concat([ pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1) return df1.join(df.drop(explode, 1), how='left')
Test Output
unnesting(df, ['B','C'], axis=0) Out[36]: B0 B1 C0 C1 A 0 1 2 1 2 1 1 3 4 3 4 2
Update 2021-02-17 with original explode function
def unnesting(df, explode, axis): if axis==1: df1 = pd.concat([df[x].explode() for x in explode], axis=1) return df1.join(df.drop(explode, 1), how='left') else : df1 = pd.concat([ pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1) return df1.join(df.drop(explode, 1), how='left')
Option 1
If all of the sublists in the other column are the same length, numpy
can be an efficient option here:
vals = np.array(df.B.values.tolist()) a = np.repeat(df.A, vals.shape[1]) pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
A B 0 1 1 1 1 2 2 2 1 3 2 2
Option 2
If the sublists have different length, you need an additional step:
vals = df.B.values.tolist() rs = [len(r) for r in vals] a = np.repeat(df.A, rs) pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
A B 0 1 1 1 1 2 2 2 1 3 2 2
Option 3
I took a shot at generalizing this to work to flatten N
columns and tile M
columns, I'll work later on making it more efficient:
df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]], 'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})
A B C D 0 1 [1, 2] [1, 2, 3] A 1 2 [1, 2, 3] [1, 2] B 2 3 [1] [1, 2] C
def unnest(df, tile, explode): vals = df[explode].sum(1) rs = [len(r) for r in vals] a = np.repeat(df[tile].values, rs, axis=0) b = np.concatenate(vals.values) d = np.column_stack((a, b)) return pd.DataFrame(d, columns = tile + ['_'.join(explode)]) unnest(df, ['A', 'D'], ['B', 'C'])
A D B_C 0 1 A 1 1 1 A 2 2 1 A 1 3 1 A 2 4 1 A 3 5 2 B 1 6 2 B 2 7 2 B 3 8 2 B 1 9 2 B 2 10 3 C 1 11 3 C 1 12 3 C 2
Functions
def wen1(df): return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'}) def wen2(df): return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)}) def wen3(df): s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len())) return s.join(df.drop('B', 1), how='left') def wen4(df): return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns) def chris1(df): vals = np.array(df.B.values.tolist()) a = np.repeat(df.A, vals.shape[1]) return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns) def chris2(df): vals = df.B.values.tolist() rs = [len(r) for r in vals] a = np.repeat(df.A.values, rs) return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
Timings
import pandas as pd import matplotlib.pyplot as plt import numpy as np from timeit import timeit res = pd.DataFrame( index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'], columns=[10, 50, 100, 500, 1000, 5000, 10000], dtype=float ) for f in res.index: for c in res.columns: df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]}) df = pd.concat([df]*c) stmt = '{}(df)'.format(f) setp = 'from __main__ import df, {}'.format(f) res.at[f, c] = timeit(stmt, setp, number=50) ax = res.div(res.min()).T.plot(loglog=True) ax.set_xlabel("N") ax.set_ylabel("time (relative)")
Performance
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With