I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient Here is how my data looks like: <pre class="prettyprint"><code>df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]}) A B C D E 0 x1 [v1, v2] [c1, c2] [d1, d2] [e1, e2] 1 x2 [v3, v4] [c3, c4] [d3, d4] [e3, e4] 2 x3 [v5, v6] [c5, c6] [d5, d6] [e5, e6] 3 x4 [v7, v8] [c7, c8] [d7, d8] [e7, e8] </code></pre> And this is the shape of my data: (441079, 12) My desired output is: <pre class="prettyprint"><code> A B C D E 0 x1 v1 c1 d1 e1 0 x1 v2 c2 d2 e2 1 x2 v3 c3 d3 e3 1 x2 v4 c4 d4 e4 ..... </code></pre> EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).

<h3>pandas >= 0.25</h3> Assuming all columns have the same number of lists, you can call <code>Series.explode</code> on each column. <pre class="prettyprint"><code>df.set_index(['A']).apply(pd.Series.explode).reset_index() A B C D E 0 x1 v1 c1 d1 e1 1 x1 v2 c2 d2 e2 2 x2 v3 c3 d3 e3 3 x2 v4 c4 d4 e4 4 x3 v5 c5 d5 e5 5 x3 v6 c6 d6 e6 6 x4 v7 c7 d7 e7 7 x4 v8 c8 d8 e8 </code></pre> The idea is to set as the index all columns that must NOT be exploded first, then reset the index after. <hr> It's also faster. <pre class="prettyprint"><code>%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index() %%timeit (df.set_index('A') .apply(lambda x: x.apply(pd.Series).stack()) .reset_index() .drop('level_1', 1)) 2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code></pre>

Efficient way to unnest (explode) multiple list columns in a pandas DataFrame

Tags:

python

json

pandas

dataframe

I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient

Here is how my data looks like:

df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})     A       B          C           D           E 0   x1  [v1, v2]    [c1, c2]    [d1, d2]    [e1, e2] 1   x2  [v3, v4]    [c3, c4]    [d3, d4]    [e3, e4] 2   x3  [v5, v6]    [c5, c6]    [d5, d6]    [e5, e6] 3   x4  [v7, v8]    [c7, c8]    [d7, d8]    [e7, e8]

And this is the shape of my data: (441079, 12)

My desired output is:

    A       B          C           D           E 0   x1      v1         c1         d1          e1 0   x1      v2         c2         d2          e2 1   x2      v3         c3         d3          e3 1   x2      v4         c4         d4          e4 .....

EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).

284

asked Aug 23 '17 18:08

Moh

2 Answers

pandas >= 0.25

Assuming all columns have the same number of lists, you can call Series.explode on each column.

df.set_index(['A']).apply(pd.Series.explode).reset_index()      A   B   C   D   E 0  x1  v1  c1  d1  e1 1  x1  v2  c2  d2  e2 2  x2  v3  c3  d3  e3 3  x2  v4  c4  d4  e4 4  x3  v5  c5  d5  e5 5  x3  v6  c6  d6  e6 6  x4  v7  c7  d7  e7 7  x4  v8  c8  d8  e8

The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.

It's also faster.

%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index() %%timeit (df.set_index('A')    .apply(lambda x: x.apply(pd.Series).stack())    .reset_index()    .drop('level_1', 1))   2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

answered Oct 08 '22 13:10

cs95

def explode(df, lst_cols, fill_value=''):     # make sure `lst_cols` is a list     if lst_cols and not isinstance(lst_cols, list):         lst_cols = [lst_cols]     # all columns except `lst_cols`     idx_cols = df.columns.difference(lst_cols)      # calculate lengths of lists     lens = df[lst_cols[0]].str.len()      if (lens > 0).all():         # ALL lists in cells aren't empty         return pd.DataFrame({             col:np.repeat(df[col].values, df[lst_cols[0]].str.len())             for col in idx_cols         }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \           .loc[:, df.columns]     else:         # at least one list in cells is empty         return pd.DataFrame({             col:np.repeat(df[col].values, df[lst_cols[0]].str.len())             for col in idx_cols         }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \           .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \           .loc[:, df.columns]

Usage:

In [82]: explode(df, lst_cols=list('BCDE')) Out[82]:     A   B   C   D   E 0  x1  v1  c1  d1  e1 1  x1  v2  c2  d2  e2 2  x2  v3  c3  d3  e3 3  x2  v4  c4  d4  e4 4  x3  v5  c5  d5  e5 5  x3  v6  c6  d6  e6 6  x4  v7  c7  d7  e7 7  x4  v8  c8  d8  e8

answered Oct 08 '22 15:10

MaxU - stop WAR against UA

Related questions
                            
                                How to check python anaconda version installed on Windows 10 PC?
                            
                                Parameter substitution for a SQLite "IN" clause
                            
                                Python remove stop words from pandas dataframe
                            
                                How to upload new versions of project to PyPI with twine?
                            
                                Using File Extension Wildcards in os.listdir(path)
                            
                                Jupyter Notebook 500 : Internal Server Error
                            
                                Python, how to read bytes from file and save it? [closed]
                            
                                What is the performance impact of non-unique indexes in pandas?
                            
                                python pandas replacing strings in dataframe with numbers
                            
                                Fine control over the font size in Seaborn plots for academic papers
                            
                                Python Pandas Group by date using datetime data
                            
                                Run multiple python scripts concurrently
                            
                                Determine whether a key is present in a dictionary [duplicate]
                            
                                Time difference in seconds from numpy.timedelta64
                            
                                Expanding English language contractions in Python
                            
                                Matplotlib Plot Lines with Colors Through Colormap
                            
                                Fillna in multiple columns in place in Python Pandas
                            
                                How do I automatically install missing python modules? [duplicate]
                            
                                Initialize list with same bool value
                            
                                Escaping chars in Python and sqlite

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With