Efficient chain merge in pandas

Tags:

I found that straightforward chain merging with pandas library is quite inefficient when you merge a lot of datasets with a big number of columns by the same column.

The root of the problem is the same as when we join a lot of str's dumb way:

joined = reduce(lambda a + b, str_list)

Instead of:

joined = ''.join(str_list)

Doing chain merge we're copying the dataset many times (in my case almost 100 times) instead of just filling columns from several datasets at once or in order.

Is there is some efficient way (= with linear complexity by the number of sets) to chain merge by the same column a lot of datasets?

572

asked Nov 12 '16 19:11

morph

1 Answers

If you have a list of your dataframes dfs:

dfs = [df1, df2, df3, ... , dfn]

you can join them using panda's concat function which as far as I can tell is faster than chaining merge. concat only joins dataframes based on an index (not a column) but with a little pre-processing you can simulate a merge operation.

First replace the index of each of your dataframes in dfs with the column you want to merge on. Lets say you want to merge on column "A":

dfs = [df.set_index("A", drop=True) for df in dfs]

Note that this will overwrite the previous indices (merge would do this anyway) so you might want to save these indices somewhere (if you are going to need them later for some reason).

Now we can use concat which will essentially merge on the index (which is actually your column!!)

merged = pd.concat(dfs, axis=1, keys=range(len(dfs)), join='outer', copy=False)

The join= argument can either be 'inner' or 'outer' (default). The copy= argument keeps concat from making unnecessary copies of your dataframes.

You can then either leave "A" as the index or you can make it back into a column by doing:

merged.reset_index(drop=False, inplace=True)

The keys= argument is optional and assigns a key value to each dataframe (in this case I gave it a range of integers but you could give them other labels if you want). This allows you to access columns from the original dataframes. So if you wanted to get the columns that correspond to the 20th dataframe in dfs you can call:

merged[20]

Without the keys= argument it can get confusing which rows are from which dataframes, especially if they have the same column names.

I'm still not entirely sure if concat runs in linear time but it is definitely faster than chaining merge:

using ipython's %timeit on lists of randomly generated dataframes (lists of 10, 100 and 1000 dataframes):

def merge_with_concat(dfs, col):                                             
    dfs = [df.set_index(col, drop=True) for df in dfs]
    merged = pd.concat(dfs, axis=1, keys=range(len(dfs)), join='outer', copy=False)
    return merged

dfs10 = [pd.util.testing.makeDataFrame() for i in range(10)] 
dfs100 = [pd.util.testing.makeDataFrame() for i in range(100)] 
dfs1000 = [pd.util.testing.makeDataFrame() for i in range(1000)] 

%timeit reduce(lambda df1, df2: df1.merge(df2, on="A", how='outer'), dfs10)
10 loops, best of 3: 45.8 ms per loop
%timeit merge_with_concat(dfs10,"A")
100 loops, best of 3: 11.7 ms per loop

%timeit merge_with_concat(dfs100,"A")
10 loops, best of 3: 139 ms per loop
%timeit reduce(lambda df1, df2: df1.merge(df2, on="A", how='outer'), dfs100)
1 loop, best of 3: 1.55 s per loop

%timeit merge_with_concat(dfs1000,"A")
1 loop, best of 3: 9.67 s per loop
%timeit reduce(lambda df1, df2: df1.merge(df2, on="A", how='outer'), dfs1000)
# I killed it after about 5 minutes so the other one is definitely faster

answered Sep 28 '22 12:09

bunji

Related questions
                            
                                Python numpy scientific notation limit decimals
                            
                                How do I execute multiple shell commands with a single python subprocess call?
                            
                                bokeh server get mouse position
                            
                                Use Flask-SQLAlchemy models in Jupyter notebook
                            
                                Replacing the existing MainWindow with a new window with Python, PyQt, Qt Designer
                            
                                Testing of existence of a repeated field in a protobuff message
                            
                                Fastest way to concatenate multiple files column wise - Python
                            
                                How to test if WindowStaysOnTopHint flag is set in windowFlags?
                            
                                Putting a vector as a Pandas data frame element
                            
                                How do I open a text file in Python?
                            
                                Get row with maximum value from groupby with several columns in PySpark
                            
                                Convert numpy array to list of datetimes
                            
                                Locking entities to perform get-update-set operations in Google Cloud datastore
                            
                                ruamel.yaml equivalent of sort_keys?
                            
                                Fill matrix diagonal with different values for each python numpy
                            
                                Python unit test for a function that has try/except
                            
                                List of dictionaries from numpy array without for loop
                            
                                Close a figure - PyCharm
                            
                                Markov Chain: Finding terminal state calculation
                            
                                Is there possibility to get releases with pyGithub

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient chain merge in pandas

Tags:

python

merge

pandas

morph

People also ask

1 Answers

bunji

Recent Activity

Donate For Us