Pandas - Explode multiple columns in pandas and assign value based on the exploded column

Tags:

Reproducible example:

ex = [{"explode1": ["a", "e", "i"], "word": "US_12", "explode2": []}, 
      {"explode1": [], "word": "US_34", "explode2": ["a", "e", "i"]}, 
      {"explode1": ["a", "e", "i"], "word": "US_56", "explode2": ["o", "u"]}]

df = pd.DataFrame(ex)

Gives you

        explode1   word   explode2
    0  [a, e, i]  US_12         []
    1         []  US_34  [a, e, i]
    2  [a, e, i]  US_56     [o, u]

You can assume there is also an explode3 and an explode4 column (excluded for the sake of brevity)

Intended Result DataFrame:

   exploded_alphabet   word    exploded_type
0                  a  US_12    explode1
1                  e  US_12    explode1
2                  i  US_12    explode1
3                  a  US_34    explode2
4                  e  US_34    explode2
5                  i  US_34    explode2
6                  a  US_54    explode1
7                  e  US_54    explode1
8                  i  US_54    explode1
9                  o  US_34    explode2
10                 u  US_34    explode2

The solution must be reproducible with 4 columns not just 2 mentioned above (I haven't included in my example explode3 and explode4 for the same of brevity)

So total number of rows will be equal to number of elements in all of the lists in explode1, explode2, explode3 and explode4 flattened.

My efforts:

Honestly, I'm thinking there must be a shorter Pythonic way rather than exploding each one individually and then exploding those that have multiple types.

df = df.explode("explode1")
df = df.explode("explode2")

The above is incorrect. Since this does not explode the rows simultaneously. It creates duplicates if list is non empty in multiple explosion columns.

The other one is the non-pythonic way where you iterate row wise and create and assign a new column - this is lengthy and easy to do. But this problem has probably been solved in a different way.

How is my question different from other "explode multiple columns" question?:

Exploding them separately. Every element in those columns creates a new row (This is probably already there on SO)
Assign the value in the exploded_type - Not sure if this has been solved on SO in conjunction to 1.

813

asked May 25 '21 11:05

imperialgendarme

Video Answer

3 Answers

Use DataFrame.melt before explode for unpivot and then remove rows with missing values (from empty lists):

df = (df.melt('word', value_name='exploded_alphabet', var_name='exploded_type')
        .explode("exploded_alphabet")
        .dropna(subset=['exploded_alphabet'])
        .reset_index(drop=True))
print (df)
     word exploded_type exploded_alphabet
0   US_12      explode1                 a
1   US_12      explode1                 e
2   US_12      explode1                 i
3   US_56      explode1                 a
4   US_56      explode1                 e
5   US_56      explode1                 i
6   US_34      explode2                 a
7   US_34      explode2                 e
8   US_34      explode2                 i
9   US_56      explode2                 o
10  US_56      explode2                 u

122

answered Nov 11 '22 07:11

jezrael

you can stack and then explode:

result = df.set_index('word').stack().explode().dropna().reset_index(
    name='exploded_alphabet').rename(columns={'level_1': 'exploded_type'})

OUTPUT:

     word exploded_type exploded_alphabet
0   US_12      explode1                 a
1   US_12      explode1                 e
2   US_12      explode1                 i
3   US_34      explode2                 a
4   US_34      explode2                 e
5   US_34      explode2                 i
6   US_56      explode1                 a
7   US_56      explode1                 e
8   US_56      explode1                 i
9   US_56      explode2                 o
10  US_56      explode2                 u

PERFORMANCE:


for _ in range(20):
    df = df.append(df)
    
len(df) # 3145728

%%timeit 
(
    df.set_index('word')
    .stack().
    explode().
    dropna().
    reset_index(name='exploded_alphabet').
    rename(columns={'level_1': 'exploded_type'})
)

4.77 s ± 62.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
(
     df.melt('word', value_name='exploded_alphabet', var_name='exploded_type')
        .explode("exploded_alphabet")
        .dropna(subset=['exploded_alphabet'])
)
6.68 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
explode_columns = ['explode1', 'explode2']
pd.melt(
    frame=df,
    id_vars='word',
    value_vars=explode_columns,
    var_name='exploded_type',
    value_name='exploded_alphabet'
).explode('exploded_alphabet').dropna()

7.17 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

answered Nov 11 '22 08:11

Nk03

You can use pd.melt to stack the columns then explode it.

explode_columns = ['explode1', 'explode2']
pd.melt(
    frame=df,
    id_vars='word',
    value_vars=explode_columns,
    var_name='exploded_type',
    value_name='exploded_alphabet'
).explode('exploded_alphabet').dropna()

It doesn't retain the same order as above but the rows are the same.

answered Nov 11 '22 07:11

chsws

Related questions
                            
                                Plotting two dataframes obtained from a loop in the same graph Python
                            
                                AttributeError: 'NoneType' object has no attribute 'excluded_of'
                            
                                trying to find the current project id of the deployed python function in google cloud gives error
                            
                                How do I turn off the "Evaluating: plt.show() did not finish after 3.00s seconds." warning in the VsCode debugger?
                            
                                How to view opts for Holoviews with Bokeh in Python
                            
                                How to handle job cancelation in Slurm?
                            
                                How to find the range of dates from a datetime column in a dataframe?
                            
                                How can I combine two dataframes based on a column of lists in Pandas
                            
                                Close position Binance Futures with ccxt
                            
                                Sum negative row values with previous rows pandas
                            
                                Can I override fields from a Pydantic parent model to make them optional?
                            
                                Read .pptx file from s3
                            
                                Matplotlib figure '.supxlabel' does not work
                            
                                Unable to access the updated global variable's value
                            
                                How to get the pivot lines from two tab-separated files?
                            
                                Update XML with an SQL query
                            
                                Most efficient way to find neighbors of neighbors in python
                            
                                turn "string-like" list into int with python [duplicate]
                            
                                AttributeError: module 'keras.utils.generic_utils' has no attribute 'populate_dict_with_module_objects'
                            
                                Run only tests which depend on the change

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas - Explode multiple columns in pandas and assign value based on the exploded column

Tags:

python

python-3.x

pandas

dataframe

numpy