Improve Pandas Merge performance

Tags:

I specifically dont have performace issue with Pands Merge, as other posts suggest, but I've a class in which there are lot of methods, which does a lot of merge on datasets.

The class has around 10 group by and around 15 merge. While groupby is pretty fast, out of total execution time of 1.5 seconds for class, around 0.7 seconds goes in those 15 merge calls.

I want to speed up performace in those merge calls. As I will have around 4000 iterations, hence saving .5 seconds overall in single iteration will lead to overall performance reduction by around 30min, which will be great.

Any suggestions I should try? I tried: Cython Numba, and Numba was slower.

Thanks

Edit 1: Adding sample code snippets: My merge statements:

tmpDf = pd.merge(self.data, t1, on='APPT_NBR', how='left') tmp = tmpDf  tmpDf = pd.merge(tmp, t2, on='APPT_NBR', how='left') tmp = tmpDf  tmpDf = pd.merge(tmp, t3, on='APPT_NBR', how='left') tmp = tmpDf  tmpDf = pd.merge(tmp, t4, on='APPT_NBR', how='left') tmp = tmpDf  tmpDf = pd.merge(tmp, t5, on='APPT_NBR', how='left')

And, by implementing Joins, I incorporate the following satatements:

dat = self.data.set_index('APPT_NBR')  t1.set_index('APPT_NBR', inplace=True) t2.set_index('APPT_NBR', inplace=True) t3.set_index('APPT_NBR', inplace=True) t4.set_index('APPT_NBR', inplace=True) t5.set_index('APPT_NBR', inplace=True)  tmpDf = dat.join(t1, how='left') tmpDf = tmpDf.join(t2, how='left') tmpDf = tmpDf.join(t3, how='left') tmpDf = tmpDf.join(t4, how='left') tmpDf = tmpDf.join(t5, how='left')  tmpDf.reset_index(inplace=True)

Note, all are part of a function named: def merge_earlier_created_values(self):

And, when I did timedcall from profilehooks by following:

@timedcall(immediate=True) def merge_earlier_created_values(self):

I get following results:

The result of profiling of that method gives:

@profile(immediate=True) def merge_earlier_created_values(self):

The profiling of function, by using Merge is as follows:

*** PROFILER RESULTS *** merge_earlier_created_values (E:\Projects\Predictive Inbound Cartoon     Estimation-MLO\Python\CodeToSubmit\helpers\get_prev_data_by_date.py:122) function called 1 times       71665 function calls (70588 primitive calls) in 0.524 seconds  Ordered by: cumulative time, internal time, call count List reduced from 563 to 40 due to restriction <40>  ncalls  tottime  percall  cumtime  percall filename:lineno(function)     1    0.012    0.012    0.524    0.524 get_prev_data_by_date.py:122(merge_earlier_created_values)    14    0.000    0.000    0.285    0.020 generic.py:1901(_update_inplace)    14    0.000    0.000    0.285    0.020 generic.py:1402(_maybe_update_cacher)    19    0.000    0.000    0.284    0.015 generic.py:1492(_check_setitem_copy)     7    0.283    0.040    0.283    0.040 {built-in method gc.collect}    15    0.000    0.000    0.181    0.012 generic.py:1842(drop)    10    0.000    0.000    0.153    0.015 merge.py:26(merge)    10    0.000    0.000    0.140    0.014 merge.py:201(get_result)   8/4    0.000    0.000    0.126    0.031 decorators.py:65(wrapper)     4    0.000    0.000    0.126    0.031 frame.py:3028(drop_duplicates)     1    0.000    0.000    0.102    0.102 get_prev_data_by_date.py:264(recreate_previous_cartons)     1    0.000    0.000    0.101    0.101 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)     1    0.000    0.000    0.098    0.098 get_prev_data_by_date.py:360(recreate_previous_freight_type)    10    0.000    0.000    0.092    0.009 internals.py:4455(concatenate_block_managers)    10    0.001    0.000    0.088    0.009 internals.py:4471(<listcomp>)   120    0.001    0.000    0.084    0.001 internals.py:4559(concatenate_join_units)   266    0.004    0.000    0.067    0.000 common.py:733(take_nd)   120    0.000    0.000    0.061    0.001 internals.py:4569(<listcomp>)   120    0.003    0.000    0.061    0.001 internals.py:4814(get_reindexed_values)     1    0.000    0.000    0.059    0.059 get_prev_data_by_date.py:295(recreate_previous_appt_status)    10    0.000    0.000    0.038    0.004 merge.py:322(_get_join_info)    10    0.001    0.000    0.036    0.004 merge.py:516(_get_join_indexers)    25    0.001    0.000    0.024    0.001 merge.py:687(_factorize_keys)    74    0.023    0.000    0.023    0.000 {pandas.algos.take_2d_axis1_object_object}    50    0.022    0.000    0.022    0.000 {method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects}   120    0.003    0.000    0.022    0.000 internals.py:4479(get_empty_dtype_and_na)    88    0.000    0.000    0.021    0.000 frame.py:1969(__getitem__)     1    0.000    0.000    0.019    0.019 get_prev_data_by_date.py:328(recreate_previous_location_numbers)    39    0.000    0.000    0.018    0.000 internals.py:3495(reindex_indexer)   537    0.017    0.000    0.017    0.000 {built-in method numpy.core.multiarray.empty}    15    0.000    0.000    0.017    0.001 ops.py:725(wrapper)    15    0.000    0.000    0.015    0.001 frame.py:2011(_getitem_array)    24    0.000    0.000    0.014    0.001 internals.py:3625(take)    10    0.000    0.000    0.014    0.001 merge.py:157(__init__)    10    0.000    0.000    0.014    0.001 merge.py:382(_get_merge_keys)    15    0.008    0.001    0.013    0.001 ops.py:662(na_op)   234    0.000    0.000    0.013    0.000 common.py:158(isnull)   234    0.001    0.000    0.013    0.000 common.py:179(_isnull_new)    15    0.000    0.000    0.012    0.001 generic.py:1609(take)    20    0.000    0.000    0.012    0.001 generic.py:2191(reindex)

The profiling by using Joins is as follows:

65079 function calls (63990 primitive calls) in 0.550 seconds  Ordered by: cumulative time, internal time, call count List reduced from 592 to 40 due to restriction <40>  ncalls  tottime  percall  cumtime  percall filename:lineno(function)     1    0.016    0.016    0.550    0.550 get_prev_data_by_date.py:122(merge_earlier_created_values)    14    0.000    0.000    0.295    0.021 generic.py:1901(_update_inplace)    14    0.000    0.000    0.295    0.021 generic.py:1402(_maybe_update_cacher)    19    0.000    0.000    0.294    0.015 generic.py:1492(_check_setitem_copy)     7    0.293    0.042    0.293    0.042 {built-in method gc.collect}    10    0.000    0.000    0.173    0.017 generic.py:1842(drop)    10    0.000    0.000    0.139    0.014 merge.py:26(merge)   8/4    0.000    0.000    0.138    0.034 decorators.py:65(wrapper)     4    0.000    0.000    0.138    0.034 frame.py:3028(drop_duplicates)    10    0.000    0.000    0.132    0.013 merge.py:201(get_result)     5    0.000    0.000    0.122    0.024 frame.py:4324(join)     5    0.000    0.000    0.122    0.024 frame.py:4371(_join_compat)     1    0.000    0.000    0.111    0.111 get_prev_data_by_date.py:264(recreate_previous_cartons)     1    0.000    0.000    0.103    0.103 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)     1    0.000    0.000    0.099    0.099 get_prev_data_by_date.py:360(recreate_previous_freight_type)    10    0.000    0.000    0.093    0.009 internals.py:4455(concatenate_block_managers)    10    0.001    0.000    0.089    0.009 internals.py:4471(<listcomp>)   100    0.001    0.000    0.085    0.001 internals.py:4559(concatenate_join_units)   205    0.003    0.000    0.068    0.000 common.py:733(take_nd)   100    0.000    0.000    0.060    0.001 internals.py:4569(<listcomp>)   100    0.001    0.000    0.060    0.001 internals.py:4814(get_reindexed_values)     1    0.000    0.000    0.056    0.056 get_prev_data_by_date.py:295(recreate_previous_appt_status)    10    0.000    0.000    0.033    0.003 merge.py:322(_get_join_info)    52    0.031    0.001    0.031    0.001 {pandas.algos.take_2d_axis1_object_object}     5    0.000    0.000    0.030    0.006 base.py:2329(join)    37    0.001    0.000    0.027    0.001 internals.py:2754(apply)     6    0.000    0.000    0.024    0.004 frame.py:2763(set_index)     7    0.000    0.000    0.023    0.003 merge.py:516(_get_join_indexers)     2    0.000    0.000    0.022    0.011 base.py:2483(_join_non_unique)     7    0.000    0.000    0.021    0.003 generic.py:2950(copy)     7    0.000    0.000    0.021    0.003 internals.py:3046(copy)    84    0.000    0.000    0.020    0.000 frame.py:1969(__getitem__)    19    0.001    0.000    0.019    0.001 merge.py:687(_factorize_keys)   100    0.002    0.000    0.019    0.000 internals.py:4479(get_empty_dtype_and_na)     1    0.000    0.000    0.018    0.018 get_prev_data_by_date.py:328(recreate_previous_location_numbers)    15    0.000    0.000    0.017    0.001 ops.py:725(wrapper)    34    0.001    0.000    0.017    0.000 internals.py:3495(reindex_indexer)    83    0.004    0.000    0.016    0.000 internals.py:3211(_consolidate_inplace)    68    0.015    0.000    0.015    0.000 {method 'copy' of 'numpy.ndarray' objects}    15    0.000    0.000    0.015    0.001 frame.py:2011(_getitem_array)

As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.

Thanks

298

asked Nov 29 '16 07:11

Debasish Kanhar

1 Answers

set_index on merging column does indeed speed this up. Below is a slightly more realistic version of julien-marrec's Answer.

import pandas as pd import numpy as np myids=np.random.choice(np.arange(10000000), size=1000000, replace=False) df1 = pd.DataFrame(myids, columns=['A']) df1['B'] = np.random.randint(0,1000,(1000000)) df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2']) df2['B2'] = np.random.randint(0,1000,(1000000))  %%timeit     x = df1.merge(df2, how='left', left_on='A', right_on='A2')    #1 loop, best of 3: 664 ms per loop  %%timeit       x = df1.set_index('A').join(df2.set_index('A2'), how='left')  #1 loop, best of 3: 354 ms per loop  %%time      df1.set_index('A', inplace=True)     df2.set_index('A2', inplace=True) #Wall time: 16 ms  %%timeit     x = df1.join(df2, how='left')   #10 loops, best of 3: 80.4 ms per loop

When the column to be joined has integers not in the same order on both tables you can still expect a great speed up of 8 times.

132

answered Sep 24 '22 04:09

Reisepass

Related questions
                            
                                How to define a __str__ method for a class?
                            
                                What error to raise when class state is invalid?
                            
                                print() vs sys.stdout.write(): which and why?
                            
                                Convert python list with None values to numpy array with nan values
                            
                                brew install python, but then: "python-2.7.6 already installed, it's just not linked"
                            
                                How to handle FileNotFoundError when "try .. except IOError" does not catch it?
                            
                                Redirect print command in python script through tqdm.write()
                            
                                How to use digit separators for Python integer literals?
                            
                                Running a Jupyter notebook from another notebook
                            
                                python typing signature (typing.Callable) for function with kwargs
                            
                                Pretty print in lxml is failing when I add tags to a parsed tree
                            
                                Getting the same subplot size using matplotlib imshow and scatter
                            
                                Reading contents of a gzip file from a AWS S3 in Python
                            
                                Can I program Nvidia's CUDA using only Python or do I have to learn C?
                            
                                Compare similarity of images using OpenCV with Python
                            
                                Boto - Uploading file to a specific location on Amazon S3
                            
                                'Attempted relative import in non-package' although packages with __init__.py in one directory
                            
                                scikit-learn DBSCAN memory usage
                            
                                Type error Unhashable type:set
                            
                                How to add or increment single item of the Python Counter class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Improve Pandas Merge performance

Tags:

python

merge

pandas

cython

numba

Debasish Kanhar

People also ask

1 Answers

Reisepass

Recent Activity

Donate For Us