pandas: optimizing my code (groupby() / apply())

Tags:

pandas

I have a dataframe of shape (RxC) 1.5M x 128. I do the following:

I do groupby() based on 6 columns. This creates ~8700 sub-groups each of shape 538 x 122.
On each sub-group, I run apply(). This function computes the % frequency of each categorical value PER column (i.e., 122) in the sub-group.

So my (pesudo) code:

<df = Read dataframe from file> g = df.groupby(grp_cols) g[nongrp_cols].apply(lambda d: d.apply(lambda s: s.value_counts()) / len(d.index))

The code is working OK for me so now I'm profiling it to improve performance. The apply() function takes about 20-25 minutes to run. I believe the problem is it is iterating over every column (122 times) for 8700 times (each subgroup) which may not be the best way (given the way I have coded it).

Can anyone recommend ways I can try to speed this up?

I tried using python multiprocessing pool (8 processes) to divide the subgroups into equal sets to process, but ended up getting some pickling error...

Thanks.

793

asked Jun 17 '15 23:06

user4979733

1 Answers

pd.DataFrame.groupby.apply really gives us a lot of flexibility (unlike agg/filter/transform, it allows you to reshape each subgroup to any shape, in your case, from 538 x 122 to N_categories x 122). But it indeed comes with a cost: apply your flexible function one-by-one and lacks of vectorization.

I still think the way to solve it is to use multiprocessing. The pickle error you encounter is most likely because you define some functions inside your multi_processing_function. The rule is that you must move all functions on top levels. See the code below.

Click to copy

import pandas as pd
import numpy as np

# simulate your data with int 0 - 9 for categorical values
df = pd.DataFrame(np.random.choice(np.arange(10), size=(538, 122)))
# simulate your groupby operations, not so cracy with 8700 sub-groups, just try 800 groups for illustration
sim_keys = ['ROW' + str(x) for x in np.arange(800)]
big_data = pd.concat([df] * 800, axis=0, keys=sim_keys)
big_data.shape

big_data.shape
Out[337]: (430400, 122)

# Without multiprocessing
# ===================================================
by_keys = big_data.groupby(level=0)

sample_group = list(by_keys)[0][1]
sample_group.shape

def your_func(g):
    return g.apply(lambda s: s.value_counts()) / len(g.index)

def test_no_multiprocessing(gb, apply_func):
    return gb.apply(apply_func)

%time result_no_multiprocessing = test_no_multiprocessing(by_keys, your_func)

CPU times: user 1min 26s, sys: 4.03 s, total: 1min 30s
Wall time: 1min 27

Pretty slow here. Let's use multiprocessing module:

Click to copy

# multiprocessing for pandas dataframe apply
# ===================================================
# to void pickle error, must define functions at TOP level, if we move this function 'process' into 'test_with_multiprocessing', it raises a pickle error
def process(df):
    return df.groupby(level=0).apply(your_func)

def test_with_multiprocessing(big_data, apply_func):

    import multiprocessing as mp

    p = mp.Pool(processes=8)
    # split it into 8 chunks
    split_dfs = np.array_split(big_data, 8, axis=0)
    # define the mapping function, wrapping it to take just df as input
    # apply to each chunk
    df_pool_results = p.map(process, split_dfs)

    p.close()

    # combine together
    result = pd.concat(df_pool_results, axis=0)

    return result


%time result_with_multiprocessing = test_with_multiprocessing(big_data, your_func)

CPU times: user 984 ms, sys: 3.46 s, total: 4.44 s
Wall time: 22.3 s

Now, it's much faster, especially in CPU times. Although a bit overheads are there when we split and recombine the result, it expects to be about 4 - 6 times faster than non-multiprocessing case, when using a 8-core processor.

Finally, check whether two results are the same.

Click to copy

import pandas.util.testing as pdt

pdt.assert_frame_equal(result_no_multiprocessing, result_with_multiprocessing)

Pass the test beautifully.

answered Oct 16 '22 21:10

Jianxun Li

Related questions
                            
                                Transforming multiindex to row-wise multi-dimensional NumPy array.
                            
                                Pandas DataFrame RangeIndex
                            
                                Pandas Merge row data with multiple values to Python list for a column
                            
                                How do I preserve datatype when using apply row-wise in pandas dataframe?
                            
                                python pandas percent change with columns of dataframe
                            
                                pandas ffill based on condition in another column
                            
                                Error in reading html to data frame in Python “html5lib not found”
                            
                                Convert pandas.core.groupby.SeriesGroupBy to a DataFrame
                            
                                Set data type for specific column when using read_csv from pandas
                            
                                Pandas MultiIndex (more than 2 levels) DataFrame to Nested Dict/JSON
                            
                                Pandas corr() returning NaN too often
                            
                                AttributeError: 'numpy.ndarray' object has no attribute 'iloc'
                            
                                how to fix the issue of CategoricalIndex column in pandas?
                            
                                labels are printed when set_xticklabels called
                            
                                python pandas error when doing groupby counts
                            
                                using rolling functions on multi-index dataframe in pandas
                            
                                write to csv from DataFrame python pandas
                            
                                Memory leak in Pandas.groupby.apply()?
                            
                                dtype: integer, but loc returns float
                            
                                subtract current time from pandas date column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandas: optimizing my code (groupby() / apply())

Tags:

pandas

user4979733

People also ask

1 Answers

Jianxun Li

Recent Activity

Donate For Us