Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is pd.concat({}, axis=1) slower than pd.concat({}, axis=0).unstack(0)?

Consider this dictionary of pandas series. The index on all series are integers and have some potential overlap, but certainly do not coincide. I made an observation that pd.concat seems slow when combining things along axis=1 when I have large indices, lots of non-overlap, and many items to concatenate. It prompted me to leave axis=0 and subsequently unstack(). I end up with the same exact result. But unstacking is quicker.

Does anyone have a handle on why this would be the case?

I get that concatenating series on top of one another should be quick, but I would have guessed that the unstacking processes would have been near identical as pd.concat(axis=1).

dict_of_series = {
    's%s' % i: pd.Series(
        1, np.unique(np.random.randint(1000, 10000, size=1000))
    ) for i in range(100)
}

%%timeit
pd.concat(dict_of_series, axis=0).unstack(0)

10 loops, best of 3: 29.6 ms per loop

%%timeit
pd.concat(dict_of_series, axis=1)

10 loops, best of 3: 43.1 ms per loop

like image 300
piRSquared Avatar asked Dec 22 '16 23:12

piRSquared


1 Answers

A quick profiling shows that a lot of time in the call pd.concat(dict_of_series, axis=1) is consumed by pd.Index.union. It is called on each index one by one inside an internal function _union_indexes - similarly as in my function union below.

On the other hand, pd.concat(dict_of_series, axis=0).unstack(0) makes use of some clever optimization, which I didn't try to analyze. Anyway, there is no call to _union_indexes or even pd.Index.union there.

Can '_union_indexes' be a culprit? Very likely. I took just the indexes:

idxs = [d.index for d in dict_of_series.values()]

and compared the following two functions:

def union(idxs):
    ans = pd.Index([])
    for idx in idxs:
        ans = ans.union(idx)
    return ans

def union_multi(idxs):
    arr0 = np.concatenate(idxs)
    arr1 = np.zeros(arr0.size)
    ans = pd.MultiIndex.from_arrays([arr0, arr1])
    return ans.levels[0]

union is an equivalent of _union_indexes, while union_multi has some multi-index overhead but avoids a call to pd.Index.union.

My rough timings:

>>> %timeit -n1 -r1 pd.concat(dict_of_series, axis=1)
1 loop, best of 1: 82.9 ms per loop
>>> %timeit -n1 -r1 pd.concat(dict_of_series, axis=0).unstack(0)
1 loop, best of 1: 57.9 ms per loop

>>> %timeit -n1 -r1 union(idxs)
1 loop, best of 1: 32.8 ms per loop
>>> %timeit -n1 -r1 union_multi(idxs)
1 loop, best of 1: 12.5 ms per loop

A simple pd.Index(np.unique(np.concatenate(idxs))) is slightly faster but it doesn't work for mixed types.

(BTW, pd.concat works only if all indexes are unique.)

like image 140
ptrj Avatar answered Oct 05 '22 23:10

ptrj