Reproducible example:
ex = [{"explode1": ["a", "e", "i"], "word": "US_12", "explode2": []},
{"explode1": [], "word": "US_34", "explode2": ["a", "e", "i"]},
{"explode1": ["a", "e", "i"], "word": "US_56", "explode2": ["o", "u"]}]
df = pd.DataFrame(ex)
Gives you
explode1 word explode2
0 [a, e, i] US_12 []
1 [] US_34 [a, e, i]
2 [a, e, i] US_56 [o, u]
You can assume there is also an explode3
and an explode4
column (excluded for the sake of brevity)
Intended Result DataFrame:
exploded_alphabet word exploded_type
0 a US_12 explode1
1 e US_12 explode1
2 i US_12 explode1
3 a US_34 explode2
4 e US_34 explode2
5 i US_34 explode2
6 a US_54 explode1
7 e US_54 explode1
8 i US_54 explode1
9 o US_34 explode2
10 u US_34 explode2
The solution must be reproducible with 4 columns not just 2 mentioned above (I haven't included in my example explode3
and explode4
for the same of brevity)
So total number of rows will be equal to number of elements in all of the lists in explode1
, explode2
, explode3
and explode4
flattened.
My efforts:
Honestly, I'm thinking there must be a shorter Pythonic way rather than exploding each one individually and then exploding those that have multiple types.
df = df.explode("explode1")
df = df.explode("explode2")
The above is incorrect. Since this does not explode the rows simultaneously. It creates duplicates if list is non empty in multiple explosion columns.
The other one is the non-pythonic way where you iterate row wise and create and assign a new column - this is lengthy and easy to do. But this problem has probably been solved in a different way.
How is my question different from other "explode multiple columns" question?:
Exploding them separately. Every element in those columns creates a new row (This is probably already there on SO)
Assign the value in the exploded_type
- Not sure if this has been solved on SO in conjunction to 1.
Column(s) to explode. For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length. If True, the resulting index will be labeled 0, 1, …, n - 1. New in version 1.1.
Pandas DataFrame: explode() functionThe explode() function is used to transform each element of a list-like to a row, replicating the index values. Exploded lists to rows of the subset columns; index will be duplicated for these rows. Raises: ValueError - if columns of the frame are not unique.
If you need to remove multiple columns from your dataset, you can either . pop() multiple times, or use pandas . drop() instead.
Pandas DataFrame explode() Method The explode() method converts each element of the specified column(s) into a row.
Use DataFrame.melt
before explode
for unpivot and then remove rows with missing values (from empty lists):
df = (df.melt('word', value_name='exploded_alphabet', var_name='exploded_type')
.explode("exploded_alphabet")
.dropna(subset=['exploded_alphabet'])
.reset_index(drop=True))
print (df)
word exploded_type exploded_alphabet
0 US_12 explode1 a
1 US_12 explode1 e
2 US_12 explode1 i
3 US_56 explode1 a
4 US_56 explode1 e
5 US_56 explode1 i
6 US_34 explode2 a
7 US_34 explode2 e
8 US_34 explode2 i
9 US_56 explode2 o
10 US_56 explode2 u
you can stack
and then explode
:
result = df.set_index('word').stack().explode().dropna().reset_index(
name='exploded_alphabet').rename(columns={'level_1': 'exploded_type'})
OUTPUT:
word exploded_type exploded_alphabet
0 US_12 explode1 a
1 US_12 explode1 e
2 US_12 explode1 i
3 US_34 explode2 a
4 US_34 explode2 e
5 US_34 explode2 i
6 US_56 explode1 a
7 US_56 explode1 e
8 US_56 explode1 i
9 US_56 explode2 o
10 US_56 explode2 u
PERFORMANCE:
for _ in range(20):
df = df.append(df)
len(df) # 3145728
%%timeit
(
df.set_index('word')
.stack().
explode().
dropna().
reset_index(name='exploded_alphabet').
rename(columns={'level_1': 'exploded_type'})
)
4.77 s ± 62.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(
df.melt('word', value_name='exploded_alphabet', var_name='exploded_type')
.explode("exploded_alphabet")
.dropna(subset=['exploded_alphabet'])
)
6.68 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
explode_columns = ['explode1', 'explode2']
pd.melt(
frame=df,
id_vars='word',
value_vars=explode_columns,
var_name='exploded_type',
value_name='exploded_alphabet'
).explode('exploded_alphabet').dropna()
7.17 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use pd.melt
to stack the columns then explode it.
explode_columns = ['explode1', 'explode2']
pd.melt(
frame=df,
id_vars='word',
value_vars=explode_columns,
var_name='exploded_type',
value_name='exploded_alphabet'
).explode('exploded_alphabet').dropna()
It doesn't retain the same order as above but the rows are the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With