Performance tests for creating equal pd.MultiIndex
using different class methods:
import pandas as pd
size_mult = 8
d1 = [1]*10**size_mult
d2 = [2]*10**size_mult
pd.__version__
'0.24.2'
Namely .from_arrays
, from_tuples
, from_frame
:
# Cell from_arrays
%%time
index_arr = pd.MultiIndex.from_arrays([d1, d2], names=['a', 'b'])
# Cell from_tuples
%%time
index_tup = pd.MultiIndex.from_tuples(zip(d1, d2), names=['a', 'b'])
# Cell from_frame
%%time
df = pd.DataFrame({'a':d1, 'b':d2})
index_frm = pd.MultiIndex.from_frame(df)
Corresponding outputs for cells:
# from_arrays
CPU times: user 1min 15s, sys: 6.58 s, total: 1min 21s
Wall time: 1min 21s
# from_tuples
CPU times: user 26.4 s, sys: 4.99 s, total: 31.4 s
Wall time: 31.3 s
# from_frame
CPU times: user 47.9 s, sys: 5.65 s, total: 53.6 s
Wall time: 53.7 s
And let's check that all results are the same for the case
index_arr.difference(index_tup)
index_arr.difference(index_frm)
All lines produce:
MultiIndex(levels=[[1], [2]],
codes=[[], []],
names=['a', 'b'])
So why is there so big difference? from_arrays
is almost 3 times slower than from_tuples
. It is even slower than create DataFrame and build index on top of it.
EDIT:
I've done another more generalized test and result was surprisingly the opposite:
np.random.seed(232)
size_mult = 7
d1 = np.random.randint(0, 10**size_mult, 10**size_mult)
d2 = np.random.randint(0, 10**size_mult, 10**size_mult)
start = pd.Timestamp.now()
index_arr = pd.MultiIndex.from_arrays([d1, d2], names=['a', 'b'])
print('ARR done in %f' % (pd.Timestamp.now()-start).total_seconds())
start = pd.Timestamp.now()
index_tup = pd.MultiIndex.from_tuples(zip(d1, d2), names=['a', 'b'])
print('TUP done in %f' % (pd.Timestamp.now()-start).total_seconds())
ARR done in 9.559764
TUP done in 70.457208
So now from_tuples
is significantly slower though source data are the same.
To create a MultiIndex with our original DataFrame, all we need to do is pass a list of columns into the .set_index () Pandas function like this: Here, we can already see that the new DataFrame called “multi” has been organized so that there are now four columns that make up the index.
The multi-level index feature in Pandas allows you to do just that. A regular Pandas DataFrame has a single column that acts as a unique row identifier, or in other words, an “index”.
You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays () ), an array of tuples (using MultiIndex.from_tuples () ), a crossed set of iterables (using MultiIndex.from_product () ), or a DataFrame (using MultiIndex.from_frame () ).
A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples (using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a DataFrame (using MultiIndex.from_frame()).
Your second example makes more sense to me. Looking at the source code for Pandas, from_tuples
actually calls from_arrays
, so it makes sense to me that from_arrays
will be faster.
from_tuples
is also doing some extra steps here that cost more time:
zip(d1, d2)
, which is actually an iterator. from_tuples
converts this into a list.from_tuples
significantly slower than from_arrays
, right off the bat.So overall, I'm not surprised from_tuples
is slower, since it has to iterate through your list of tuples an extra two times (and do some extra stuff) before even making it to the from_arrays
function (which iterates a couple more times, by the way) that it uses anyways.
from_tuples
converts iterators to lists, then lists to arrays, then arrays into lists of arrays, then ultimately calls from_arrays
on that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With