Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to unnest (explode) multiple list columns in a pandas DataFrame

I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient

Here is how my data looks like:

df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})     A       B          C           D           E 0   x1  [v1, v2]    [c1, c2]    [d1, d2]    [e1, e2] 1   x2  [v3, v4]    [c3, c4]    [d3, d4]    [e3, e4] 2   x3  [v5, v6]    [c5, c6]    [d5, d6]    [e5, e6] 3   x4  [v7, v8]    [c7, c8]    [d7, d8]    [e7, e8] 

And this is the shape of my data: (441079, 12)

My desired output is:

    A       B          C           D           E 0   x1      v1         c1         d1          e1 0   x1      v2         c2         d2          e2 1   x2      v3         c3         d3          e3 1   x2      v4         c4         d4          e4 ..... 

EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).

like image 284
Moh Avatar asked Aug 23 '17 18:08

Moh


People also ask

Can you pop multiple columns pandas?

If you need to remove multiple columns from your dataset, you can either . pop() multiple times, or use pandas . drop() instead.

How do I explode multiple columns?

apply(pd. Series. explode) . This will explode all the columns with lists in your dataframe.

How do you explode all columns in a DataFrame?

Column(s) to explode. For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length. If True, the resulting index will be labeled 0, 1, …, n - 1. New in version 1.1.


2 Answers

pandas >= 0.25

Assuming all columns have the same number of lists, you can call Series.explode on each column.

df.set_index(['A']).apply(pd.Series.explode).reset_index()      A   B   C   D   E 0  x1  v1  c1  d1  e1 1  x1  v2  c2  d2  e2 2  x2  v3  c3  d3  e3 3  x2  v4  c4  d4  e4 4  x3  v5  c5  d5  e5 5  x3  v6  c6  d6  e6 6  x4  v7  c7  d7  e7 7  x4  v8  c8  d8  e8 

The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.


It's also faster.

%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index() %%timeit (df.set_index('A')    .apply(lambda x: x.apply(pd.Series).stack())    .reset_index()    .drop('level_1', 1))   2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
like image 77
cs95 Avatar answered Oct 08 '22 13:10

cs95


def explode(df, lst_cols, fill_value=''):     # make sure `lst_cols` is a list     if lst_cols and not isinstance(lst_cols, list):         lst_cols = [lst_cols]     # all columns except `lst_cols`     idx_cols = df.columns.difference(lst_cols)      # calculate lengths of lists     lens = df[lst_cols[0]].str.len()      if (lens > 0).all():         # ALL lists in cells aren't empty         return pd.DataFrame({             col:np.repeat(df[col].values, df[lst_cols[0]].str.len())             for col in idx_cols         }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \           .loc[:, df.columns]     else:         # at least one list in cells is empty         return pd.DataFrame({             col:np.repeat(df[col].values, df[lst_cols[0]].str.len())             for col in idx_cols         }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \           .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \           .loc[:, df.columns] 

Usage:

In [82]: explode(df, lst_cols=list('BCDE')) Out[82]:     A   B   C   D   E 0  x1  v1  c1  d1  e1 1  x1  v2  c2  d2  e2 2  x2  v3  c3  d3  e3 3  x2  v4  c4  d4  e4 4  x3  v5  c5  d5  e5 5  x3  v6  c6  d6  e6 6  x4  v7  c7  d7  e7 7  x4  v8  c8  d8  e8 
like image 32
MaxU - stop WAR against UA Avatar answered Oct 08 '22 15:10

MaxU - stop WAR against UA