Consider this dictionary of pandas series. The index on all series are integers and have some potential overlap, but certainly do not coincide. I made an observation that pd.concat
seems slow when combining things along axis=1
when I have large indices, lots of non-overlap, and many items to concatenate. It prompted me to leave axis=0
and subsequently unstack()
. I end up with the same exact result. But unstacking is quicker.
Does anyone have a handle on why this would be the case?
I get that concatenating series on top of one another should be quick, but I would have guessed that the unstacking processes would have been near identical as pd.concat(axis=1)
.
dict_of_series = {
's%s' % i: pd.Series(
1, np.unique(np.random.randint(1000, 10000, size=1000))
) for i in range(100)
}
%%timeit
pd.concat(dict_of_series, axis=0).unstack(0)
10 loops, best of 3: 29.6 ms per loop
%%timeit
pd.concat(dict_of_series, axis=1)
10 loops, best of 3: 43.1 ms per loop
A quick profiling shows that a lot of time in the call pd.concat(dict_of_series, axis=1)
is consumed by pd.Index.union
. It is called on each index one by one inside an internal function _union_indexes
- similarly as in my function union
below.
On the other hand, pd.concat(dict_of_series, axis=0).unstack(0)
makes use of some clever optimization, which I didn't try to analyze. Anyway, there is no call to _union_indexes
or even pd.Index.union
there.
Can '_union_indexes' be a culprit? Very likely. I took just the indexes:
idxs = [d.index for d in dict_of_series.values()]
and compared the following two functions:
def union(idxs):
ans = pd.Index([])
for idx in idxs:
ans = ans.union(idx)
return ans
def union_multi(idxs):
arr0 = np.concatenate(idxs)
arr1 = np.zeros(arr0.size)
ans = pd.MultiIndex.from_arrays([arr0, arr1])
return ans.levels[0]
union
is an equivalent of _union_indexes
, while union_multi
has some multi-index overhead but avoids a call to pd.Index.union
.
My rough timings:
>>> %timeit -n1 -r1 pd.concat(dict_of_series, axis=1)
1 loop, best of 1: 82.9 ms per loop
>>> %timeit -n1 -r1 pd.concat(dict_of_series, axis=0).unstack(0)
1 loop, best of 1: 57.9 ms per loop
>>> %timeit -n1 -r1 union(idxs)
1 loop, best of 1: 32.8 ms per loop
>>> %timeit -n1 -r1 union_multi(idxs)
1 loop, best of 1: 12.5 ms per loop
A simple pd.Index(np.unique(np.concatenate(idxs)))
is slightly faster but it doesn't work for mixed types.
(BTW, pd.concat
works only if all indexes are unique.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With