I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient
Here is how my data looks like:
df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]}) A B C D E 0 x1 [v1, v2] [c1, c2] [d1, d2] [e1, e2] 1 x2 [v3, v4] [c3, c4] [d3, d4] [e3, e4] 2 x3 [v5, v6] [c5, c6] [d5, d6] [e5, e6] 3 x4 [v7, v8] [c7, c8] [d7, d8] [e7, e8]
And this is the shape of my data: (441079, 12)
My desired output is:
A B C D E 0 x1 v1 c1 d1 e1 0 x1 v2 c2 d2 e2 1 x2 v3 c3 d3 e3 1 x2 v4 c4 d4 e4 .....
EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).
If you need to remove multiple columns from your dataset, you can either . pop() multiple times, or use pandas . drop() instead.
apply(pd. Series. explode) . This will explode all the columns with lists in your dataframe.
Column(s) to explode. For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length. If True, the resulting index will be labeled 0, 1, …, n - 1. New in version 1.1.
Assuming all columns have the same number of lists, you can call Series.explode
on each column.
df.set_index(['A']).apply(pd.Series.explode).reset_index() A B C D E 0 x1 v1 c1 d1 e1 1 x1 v2 c2 d2 e2 2 x2 v3 c3 d3 e3 3 x2 v4 c4 d4 e4 4 x3 v5 c5 d5 e5 5 x3 v6 c6 d6 e6 6 x4 v7 c7 d7 e7 7 x4 v8 c8 d8 e8
The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.
It's also faster.
%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index() %%timeit (df.set_index('A') .apply(lambda x: x.apply(pd.Series).stack()) .reset_index() .drop('level_1', 1)) 2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
def explode(df, lst_cols, fill_value=''): # make sure `lst_cols` is a list if lst_cols and not isinstance(lst_cols, list): lst_cols = [lst_cols] # all columns except `lst_cols` idx_cols = df.columns.difference(lst_cols) # calculate lengths of lists lens = df[lst_cols[0]].str.len() if (lens > 0).all(): # ALL lists in cells aren't empty return pd.DataFrame({ col:np.repeat(df[col].values, df[lst_cols[0]].str.len()) for col in idx_cols }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \ .loc[:, df.columns] else: # at least one list in cells is empty return pd.DataFrame({ col:np.repeat(df[col].values, df[lst_cols[0]].str.len()) for col in idx_cols }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \ .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \ .loc[:, df.columns]
Usage:
In [82]: explode(df, lst_cols=list('BCDE')) Out[82]: A B C D E 0 x1 v1 c1 d1 e1 1 x1 v2 c2 d2 e2 2 x2 v3 c3 d3 e3 3 x2 v4 c4 d4 e4 4 x3 v5 c5 d5 e5 5 x3 v6 c6 d6 e6 6 x4 v7 c7 d7 e7 7 x4 v8 c8 d8 e8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With