Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the running time (big "O" order) of pandas DataFrame.join?

This problem is more conceptual/theoretical (has to do with run times for very large datasets), so I apologize for not having a minimial example to show.

I have a bunch of DataFrames from two different sensors that I need to eventually concatenate into two very large DataFrames from two different sensors (df_snsr1 and df_snsr2), and then left join into a single DataFrame. My data is such that I can also join first, then concat, or some combination. I am trying to figure out the most efficient way to do this.

From reading this SO answer I know that pandas.concat allocates space for the concatenation of all of its dataframes, and if you do this in a loop it can lead to O(N**2) copying and some major slowdowns. Thus I am currently first building a big list of dataframes (loaded from files), concatenating them all at once, and then joining the two big dataframes:

df_list = []
for file in my_pickle_files_snsr1:  # O(M) loop over M files
    df_list.append(pd.read_pickle(file))  # O(1) append, M times
df_snsr1 = pd.concat(df_list)  # O(N) copies of N records
# repeat for sensor 2 (df_snsr2)
df_snsr1.join(df_snsr2, on=['some', 'columns'])  # O(dunno, maybe bears?)

I am unable to find anything about execution speed in the documentation on pandas.DataFrame.join. Is it O(N)? O(N**2)? My thought is that if it is similar order to pandas.concat, then it really doesn't matter what order I do the two operations in. If it is O(N**2), however, then it will likely be more efficient for me to join many small dataframes and then concatenate them rather than concat and then join. The overall operation takes long enough that it is worth-while for me to ask the question on here, so "run it and see" isn't going to work.

Does anybody know what algorithm join is using and what its execution big-O order is? Or does anybody have any other suggestions on getting the most-efficient combination of join and concat?

like image 593
Engineero Avatar asked Aug 06 '18 20:08

Engineero


People also ask

Which is faster join or merge Pandas?

merge)

Which is faster merge or join?

Merge join is used when projections of the joined tables are sorted on the join columns. Merge joins are faster and uses less memory than hash joins.

Does order matter for Pandas merge?

Answer. Yes. Order of the merged dataframes will effect the order of the rows and columns of the merged dataframe. When using the merge() method, it will preserve the order of the left keys.

How is Pandas merge so fast?

It generates two frames with a million rows each, in random order. Then it generates two more that have been sorted on the first column. Then it merges the first two, and last, merges the second two.


1 Answers

I think it depends on the options you pass to join (e.g. the type of join and whether to sort).

When using the default how='left', it appears that the result is sorted, at least for single index (the doc only specifies the order of the output for some of the how methods, and inner isn't one of them). In any case, sort is O(n log n). Each index lookup is O(1) and there are O(n) of them. So, in that case, O(n log n) dominates.

By contrast, in the how='inner' case, it is specified that order of the calling DataFrame is kept. In that case, we would expect O(n) (both for a possible set intersection and for the index lookup and insertion).

In either case, as the size gets larger, various issues of cache-locality (or lack thereof) start creeping up on you, and the actual time spent accessing a large memory area in random access will start to dominate. The above is only regarding the operation complexity.

As mentioned elsewhere, for larger datasets, Dask is a way to go, or Spark.


But what do you say we test it (at least the how='left' case)? The code below is a bit more verbose than I would have liked (and the name generation is just plain silly), but it does just that. Essentially, it makes two DFs with random names, unordered, and with 1 - replace_fraction fraction in common; then it joins them while measuring the time used.

from IPython.core.magics.execution import _format_time as walltime

def make_names(n):
    names = [
        f'{x}{y}{z}' for (x, y), z in zip(
            np.random.choice(['foo', 'bar', 'hi'], (n, 2)),
            np.random.randint(0, n, size=n))
    ]
    return names

def work(n, replace_fraction=0.1):
    a_names = make_names(n)
    replace_n = int(n * replace_fraction)
    b_names = make_names(replace_n) + list(np.random.choice(a_names, size=n - replace_n, replace=False))
    np.random.shuffle(b_names)
    a = pd.DataFrame({
        'name': a_names,
        'v': np.random.uniform(size=n),
        'w': np.random.uniform(size=n),
    }).set_index('name')
    b = pd.DataFrame({
        'name': b_names,
        'v': np.random.uniform(size=n),
        'w': np.random.uniform(size=n),
    }).set_index('name')

    t0 = time.time()
    df = a.join(b, rsuffix='_r')
    dt = time.time() - t0
    return a, b, df, dt

Example: try work(4, .5).

Now, get some time measurements for a geometric series of sizes:

sizes = (2**np.arange(10, 23, .5)).astype(int)
times = []
for n in sizes:
    a, b, df, dt = work(n)
    times.append(dt)
    print(f'{n}: {walltime(dt)}')

# out:
1024: 2.9 ms
1448: 4.78 ms
2048: 4.37 ms
...
2965820: 18.2 s
4194304: 30.2 s
5931641: 44.8 s

Fit for n log n:

from numpy.polynomial.polynomial import polyfit

n = np.array(sizes)
t = np.array(times)
b, m = polyfit(n * np.log(n), t, 1)

plt.plot(n/1e6, t, '.')
plt.plot(n/1e6, b + m * n * np.log(n), '-')
plt.xlabel('size [M]')
plt.ylabel('time [s]')
plt.show()

enter image description here

(side note: scipy.optimize.nnls with all terms n, log n, n log n, 1 finds all coefficients 0 except for n log n, so the above is fine).

like image 129
Pierre D Avatar answered Nov 15 '22 08:11

Pierre D