Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve Pandas Merge performance

I specifically dont have performace issue with Pands Merge, as other posts suggest, but I've a class in which there are lot of methods, which does a lot of merge on datasets.

The class has around 10 group by and around 15 merge. While groupby is pretty fast, out of total execution time of 1.5 seconds for class, around 0.7 seconds goes in those 15 merge calls.

I want to speed up performace in those merge calls. As I will have around 4000 iterations, hence saving .5 seconds overall in single iteration will lead to overall performance reduction by around 30min, which will be great.

Any suggestions I should try? I tried: Cython Numba, and Numba was slower.

Thanks

Edit 1: Adding sample code snippets: My merge statements:

tmpDf = pd.merge(self.data, t1, on='APPT_NBR', how='left') tmp = tmpDf  tmpDf = pd.merge(tmp, t2, on='APPT_NBR', how='left') tmp = tmpDf  tmpDf = pd.merge(tmp, t3, on='APPT_NBR', how='left') tmp = tmpDf  tmpDf = pd.merge(tmp, t4, on='APPT_NBR', how='left') tmp = tmpDf  tmpDf = pd.merge(tmp, t5, on='APPT_NBR', how='left') 

And, by implementing Joins, I incorporate the following satatements:

dat = self.data.set_index('APPT_NBR')  t1.set_index('APPT_NBR', inplace=True) t2.set_index('APPT_NBR', inplace=True) t3.set_index('APPT_NBR', inplace=True) t4.set_index('APPT_NBR', inplace=True) t5.set_index('APPT_NBR', inplace=True)  tmpDf = dat.join(t1, how='left') tmpDf = tmpDf.join(t2, how='left') tmpDf = tmpDf.join(t3, how='left') tmpDf = tmpDf.join(t4, how='left') tmpDf = tmpDf.join(t5, how='left')  tmpDf.reset_index(inplace=True) 

Note, all are part of a function named: def merge_earlier_created_values(self):

And, when I did timedcall from profilehooks by following:

@timedcall(immediate=True) def merge_earlier_created_values(self): 

I get following results:

The result of profiling of that method gives:

@profile(immediate=True) def merge_earlier_created_values(self): 

The profiling of function, by using Merge is as follows:

*** PROFILER RESULTS *** merge_earlier_created_values (E:\Projects\Predictive Inbound Cartoon     Estimation-MLO\Python\CodeToSubmit\helpers\get_prev_data_by_date.py:122) function called 1 times       71665 function calls (70588 primitive calls) in 0.524 seconds  Ordered by: cumulative time, internal time, call count List reduced from 563 to 40 due to restriction <40>  ncalls  tottime  percall  cumtime  percall filename:lineno(function)     1    0.012    0.012    0.524    0.524 get_prev_data_by_date.py:122(merge_earlier_created_values)    14    0.000    0.000    0.285    0.020 generic.py:1901(_update_inplace)    14    0.000    0.000    0.285    0.020 generic.py:1402(_maybe_update_cacher)    19    0.000    0.000    0.284    0.015 generic.py:1492(_check_setitem_copy)     7    0.283    0.040    0.283    0.040 {built-in method gc.collect}    15    0.000    0.000    0.181    0.012 generic.py:1842(drop)    10    0.000    0.000    0.153    0.015 merge.py:26(merge)    10    0.000    0.000    0.140    0.014 merge.py:201(get_result)   8/4    0.000    0.000    0.126    0.031 decorators.py:65(wrapper)     4    0.000    0.000    0.126    0.031 frame.py:3028(drop_duplicates)     1    0.000    0.000    0.102    0.102 get_prev_data_by_date.py:264(recreate_previous_cartons)     1    0.000    0.000    0.101    0.101 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)     1    0.000    0.000    0.098    0.098 get_prev_data_by_date.py:360(recreate_previous_freight_type)    10    0.000    0.000    0.092    0.009 internals.py:4455(concatenate_block_managers)    10    0.001    0.000    0.088    0.009 internals.py:4471(<listcomp>)   120    0.001    0.000    0.084    0.001 internals.py:4559(concatenate_join_units)   266    0.004    0.000    0.067    0.000 common.py:733(take_nd)   120    0.000    0.000    0.061    0.001 internals.py:4569(<listcomp>)   120    0.003    0.000    0.061    0.001 internals.py:4814(get_reindexed_values)     1    0.000    0.000    0.059    0.059 get_prev_data_by_date.py:295(recreate_previous_appt_status)    10    0.000    0.000    0.038    0.004 merge.py:322(_get_join_info)    10    0.001    0.000    0.036    0.004 merge.py:516(_get_join_indexers)    25    0.001    0.000    0.024    0.001 merge.py:687(_factorize_keys)    74    0.023    0.000    0.023    0.000 {pandas.algos.take_2d_axis1_object_object}    50    0.022    0.000    0.022    0.000 {method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects}   120    0.003    0.000    0.022    0.000 internals.py:4479(get_empty_dtype_and_na)    88    0.000    0.000    0.021    0.000 frame.py:1969(__getitem__)     1    0.000    0.000    0.019    0.019 get_prev_data_by_date.py:328(recreate_previous_location_numbers)    39    0.000    0.000    0.018    0.000 internals.py:3495(reindex_indexer)   537    0.017    0.000    0.017    0.000 {built-in method numpy.core.multiarray.empty}    15    0.000    0.000    0.017    0.001 ops.py:725(wrapper)    15    0.000    0.000    0.015    0.001 frame.py:2011(_getitem_array)    24    0.000    0.000    0.014    0.001 internals.py:3625(take)    10    0.000    0.000    0.014    0.001 merge.py:157(__init__)    10    0.000    0.000    0.014    0.001 merge.py:382(_get_merge_keys)    15    0.008    0.001    0.013    0.001 ops.py:662(na_op)   234    0.000    0.000    0.013    0.000 common.py:158(isnull)   234    0.001    0.000    0.013    0.000 common.py:179(_isnull_new)    15    0.000    0.000    0.012    0.001 generic.py:1609(take)    20    0.000    0.000    0.012    0.001 generic.py:2191(reindex) 

The profiling by using Joins is as follows:

65079 function calls (63990 primitive calls) in 0.550 seconds  Ordered by: cumulative time, internal time, call count List reduced from 592 to 40 due to restriction <40>  ncalls  tottime  percall  cumtime  percall filename:lineno(function)     1    0.016    0.016    0.550    0.550 get_prev_data_by_date.py:122(merge_earlier_created_values)    14    0.000    0.000    0.295    0.021 generic.py:1901(_update_inplace)    14    0.000    0.000    0.295    0.021 generic.py:1402(_maybe_update_cacher)    19    0.000    0.000    0.294    0.015 generic.py:1492(_check_setitem_copy)     7    0.293    0.042    0.293    0.042 {built-in method gc.collect}    10    0.000    0.000    0.173    0.017 generic.py:1842(drop)    10    0.000    0.000    0.139    0.014 merge.py:26(merge)   8/4    0.000    0.000    0.138    0.034 decorators.py:65(wrapper)     4    0.000    0.000    0.138    0.034 frame.py:3028(drop_duplicates)    10    0.000    0.000    0.132    0.013 merge.py:201(get_result)     5    0.000    0.000    0.122    0.024 frame.py:4324(join)     5    0.000    0.000    0.122    0.024 frame.py:4371(_join_compat)     1    0.000    0.000    0.111    0.111 get_prev_data_by_date.py:264(recreate_previous_cartons)     1    0.000    0.000    0.103    0.103 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)     1    0.000    0.000    0.099    0.099 get_prev_data_by_date.py:360(recreate_previous_freight_type)    10    0.000    0.000    0.093    0.009 internals.py:4455(concatenate_block_managers)    10    0.001    0.000    0.089    0.009 internals.py:4471(<listcomp>)   100    0.001    0.000    0.085    0.001 internals.py:4559(concatenate_join_units)   205    0.003    0.000    0.068    0.000 common.py:733(take_nd)   100    0.000    0.000    0.060    0.001 internals.py:4569(<listcomp>)   100    0.001    0.000    0.060    0.001 internals.py:4814(get_reindexed_values)     1    0.000    0.000    0.056    0.056 get_prev_data_by_date.py:295(recreate_previous_appt_status)    10    0.000    0.000    0.033    0.003 merge.py:322(_get_join_info)    52    0.031    0.001    0.031    0.001 {pandas.algos.take_2d_axis1_object_object}     5    0.000    0.000    0.030    0.006 base.py:2329(join)    37    0.001    0.000    0.027    0.001 internals.py:2754(apply)     6    0.000    0.000    0.024    0.004 frame.py:2763(set_index)     7    0.000    0.000    0.023    0.003 merge.py:516(_get_join_indexers)     2    0.000    0.000    0.022    0.011 base.py:2483(_join_non_unique)     7    0.000    0.000    0.021    0.003 generic.py:2950(copy)     7    0.000    0.000    0.021    0.003 internals.py:3046(copy)    84    0.000    0.000    0.020    0.000 frame.py:1969(__getitem__)    19    0.001    0.000    0.019    0.001 merge.py:687(_factorize_keys)   100    0.002    0.000    0.019    0.000 internals.py:4479(get_empty_dtype_and_na)     1    0.000    0.000    0.018    0.018 get_prev_data_by_date.py:328(recreate_previous_location_numbers)    15    0.000    0.000    0.017    0.001 ops.py:725(wrapper)    34    0.001    0.000    0.017    0.000 internals.py:3495(reindex_indexer)    83    0.004    0.000    0.016    0.000 internals.py:3211(_consolidate_inplace)    68    0.015    0.000    0.015    0.000 {method 'copy' of 'numpy.ndarray' objects}    15    0.000    0.000    0.015    0.001 frame.py:2011(_getitem_array) 

As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.

Thanks

like image 298
Debasish Kanhar Avatar asked Nov 29 '16 07:11

Debasish Kanhar


People also ask

Is merge faster than join Pandas?

The Fastest Ways As it turns out, join always tends to perform well, and merge will perform almost exactly the same given the syntax is optimal.

Can you merge more than 2 Dataframes in Pandas?

We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join inner , left , right , outer , cross .

What is the difference between Pandas merge and join?

Pandas Join vs Merge Differences The main difference between join vs merge would be; join() is used to combine two DataFrames on the index but not on columns whereas merge() is primarily used to specify the columns you wanted to join on, this also supports joining on indexes and combination of index and columns.


1 Answers

set_index on merging column does indeed speed this up. Below is a slightly more realistic version of julien-marrec's Answer.

import pandas as pd import numpy as np myids=np.random.choice(np.arange(10000000), size=1000000, replace=False) df1 = pd.DataFrame(myids, columns=['A']) df1['B'] = np.random.randint(0,1000,(1000000)) df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2']) df2['B2'] = np.random.randint(0,1000,(1000000))  %%timeit     x = df1.merge(df2, how='left', left_on='A', right_on='A2')    #1 loop, best of 3: 664 ms per loop  %%timeit       x = df1.set_index('A').join(df2.set_index('A2'), how='left')  #1 loop, best of 3: 354 ms per loop  %%time      df1.set_index('A', inplace=True)     df2.set_index('A2', inplace=True) #Wall time: 16 ms  %%timeit     x = df1.join(df2, how='left')   #10 loops, best of 3: 80.4 ms per loop 

When the column to be joined has integers not in the same order on both tables you can still expect a great speed up of 8 times.

like image 132
Reisepass Avatar answered Sep 24 '22 04:09

Reisepass