What is the running time (big "O" order) of pandas DataFrame.join?

Tags:

This problem is more conceptual/theoretical (has to do with run times for very large datasets), so I apologize for not having a minimial example to show.

I have a bunch of DataFrames from two different sensors that I need to eventually concatenate into two very large DataFrames from two different sensors (df_snsr1 and df_snsr2), and then left join into a single DataFrame. My data is such that I can also join first, then concat, or some combination. I am trying to figure out the most efficient way to do this.

From reading this SO answer I know that pandas.concat allocates space for the concatenation of all of its dataframes, and if you do this in a loop it can lead to O(N**2) copying and some major slowdowns. Thus I am currently first building a big list of dataframes (loaded from files), concatenating them all at once, and then joining the two big dataframes:

df_list = []
for file in my_pickle_files_snsr1:  # O(M) loop over M files
    df_list.append(pd.read_pickle(file))  # O(1) append, M times
df_snsr1 = pd.concat(df_list)  # O(N) copies of N records
# repeat for sensor 2 (df_snsr2)
df_snsr1.join(df_snsr2, on=['some', 'columns'])  # O(dunno, maybe bears?)

I am unable to find anything about execution speed in the documentation on pandas.DataFrame.join. Is it O(N)? O(N**2)? My thought is that if it is similar order to pandas.concat, then it really doesn't matter what order I do the two operations in. If it is O(N**2), however, then it will likely be more efficient for me to join many small dataframes and then concatenate them rather than concat and then join. The overall operation takes long enough that it is worth-while for me to ask the question on here, so "run it and see" isn't going to work.

Does anybody know what algorithm join is using and what its execution big-O order is? Or does anybody have any other suggestions on getting the most-efficient combination of join and concat?

593

asked Aug 06 '18 20:08

Engineero

1 Answers

I think it depends on the options you pass to join (e.g. the type of join and whether to sort).

When using the default how='left', it appears that the result is sorted, at least for single index (the doc only specifies the order of the output for some of the how methods, and inner isn't one of them). In any case, sort is O(n log n). Each index lookup is O(1) and there are O(n) of them. So, in that case, O(n log n) dominates.

By contrast, in the how='inner' case, it is specified that order of the calling DataFrame is kept. In that case, we would expect O(n) (both for a possible set intersection and for the index lookup and insertion).

In either case, as the size gets larger, various issues of cache-locality (or lack thereof) start creeping up on you, and the actual time spent accessing a large memory area in random access will start to dominate. The above is only regarding the operation complexity.

As mentioned elsewhere, for larger datasets, Dask is a way to go, or Spark.

But what do you say we test it (at least the how='left' case)? The code below is a bit more verbose than I would have liked (and the name generation is just plain silly), but it does just that. Essentially, it makes two DFs with random names, unordered, and with 1 - replace_fraction fraction in common; then it joins them while measuring the time used.

from IPython.core.magics.execution import _format_time as walltime

def make_names(n):
    names = [
        f'{x}{y}{z}' for (x, y), z in zip(
            np.random.choice(['foo', 'bar', 'hi'], (n, 2)),
            np.random.randint(0, n, size=n))
    ]
    return names

def work(n, replace_fraction=0.1):
    a_names = make_names(n)
    replace_n = int(n * replace_fraction)
    b_names = make_names(replace_n) + list(np.random.choice(a_names, size=n - replace_n, replace=False))
    np.random.shuffle(b_names)
    a = pd.DataFrame({
        'name': a_names,
        'v': np.random.uniform(size=n),
        'w': np.random.uniform(size=n),
    }).set_index('name')
    b = pd.DataFrame({
        'name': b_names,
        'v': np.random.uniform(size=n),
        'w': np.random.uniform(size=n),
    }).set_index('name')

    t0 = time.time()
    df = a.join(b, rsuffix='_r')
    dt = time.time() - t0
    return a, b, df, dt

Example: try work(4, .5).

Now, get some time measurements for a geometric series of sizes:

sizes = (2**np.arange(10, 23, .5)).astype(int)
times = []
for n in sizes:
    a, b, df, dt = work(n)
    times.append(dt)
    print(f'{n}: {walltime(dt)}')

# out:
1024: 2.9 ms
1448: 4.78 ms
2048: 4.37 ms
...
2965820: 18.2 s
4194304: 30.2 s
5931641: 44.8 s

Fit for n log n:

from numpy.polynomial.polynomial import polyfit

n = np.array(sizes)
t = np.array(times)
b, m = polyfit(n * np.log(n), t, 1)

plt.plot(n/1e6, t, '.')
plt.plot(n/1e6, b + m * n * np.log(n), '-')
plt.xlabel('size [M]')
plt.ylabel('time [s]')
plt.show()

enter image description here

(side note: scipy.optimize.nnls with all terms n, log n, n log n, 1 finds all coefficients 0 except for n log n, so the above is fine).

129

answered Nov 15 '22 08:11

Pierre D

Related questions
                            
                                Curious Modulus Operator (%) Result
                            
                                Building a nested tree-like structure in Python using recursive or iterative approach
                            
                                "Resolve Package Not Found" error in anaconda
                            
                                Should logger be an argument or a global variable?
                            
                                Youtube Analytics API returns 403 forbidden even if token is valid
                            
                                Pandas 'reduce' and 'accumulate' functions - incomplete implementation
                            
                                Get color scheme from GTK
                            
                                Serializing a sqlalchemy hybrid_property using marshmallow
                            
                                Can't execute msg (and other) Windows commands via subprocess
                            
                                Writing dictionary of dataframes to file
                            
                                what is the difference between 'import a.b as b' and 'from a import b' in python [duplicate]
                            
                                How to enable autocomplete (IntelliSense) for python package modules?
                            
                                How to convert exe back to Python script
                            
                                Autoencoder loss is not decreasing (and starts very high)
                            
                                Cython: undefined symbol
                            
                                Variable context between two blocks in Django templates?
                            
                                pandas iteratively update column values
                            
                                Install and use RPy2 (using conda) so that it uses default R installation in /usr/lib/R R
                            
                                List of classinfo Types
                            
                                asyncio event_loop in a Flask app

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the running time (big "O" order) of pandas DataFrame.join?

Tags:

python

big-o

pandas

dataframe

execution-time

Engineero

People also ask

1 Answers

Pierre D

Recent Activity

Donate For Us