Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas multiindex creation performance

Performance tests for creating equal pd.MultiIndex using different class methods:

import pandas as pd

size_mult = 8
d1 = [1]*10**size_mult
d2 = [2]*10**size_mult

pd.__version__
'0.24.2'

Namely .from_arrays, from_tuples, from_frame:

# Cell from_arrays
%%time
index_arr = pd.MultiIndex.from_arrays([d1, d2], names=['a', 'b'])
# Cell from_tuples
%%time
index_tup = pd.MultiIndex.from_tuples(zip(d1, d2), names=['a', 'b'])
# Cell from_frame
%%time
df = pd.DataFrame({'a':d1, 'b':d2})
index_frm = pd.MultiIndex.from_frame(df)

Corresponding outputs for cells:

# from_arrays
CPU times: user 1min 15s, sys: 6.58 s, total: 1min 21s
Wall time: 1min 21s
# from_tuples
CPU times: user 26.4 s, sys: 4.99 s, total: 31.4 s
Wall time: 31.3 s
# from_frame
CPU times: user 47.9 s, sys: 5.65 s, total: 53.6 s
Wall time: 53.7 s

And let's check that all results are the same for the case

index_arr.difference(index_tup)
index_arr.difference(index_frm)

All lines produce:

MultiIndex(levels=[[1], [2]],
           codes=[[], []],
           names=['a', 'b'])

So why is there so big difference? from_arrays is almost 3 times slower than from_tuples. It is even slower than create DataFrame and build index on top of it.

EDIT:

I've done another more generalized test and result was surprisingly the opposite:

np.random.seed(232)

size_mult = 7
d1 = np.random.randint(0, 10**size_mult, 10**size_mult)
d2 = np.random.randint(0, 10**size_mult, 10**size_mult)

start = pd.Timestamp.now()
index_arr = pd.MultiIndex.from_arrays([d1, d2], names=['a', 'b'])
print('ARR done in %f' % (pd.Timestamp.now()-start).total_seconds())

start = pd.Timestamp.now()
index_tup = pd.MultiIndex.from_tuples(zip(d1, d2), names=['a', 'b'])
print('TUP done in %f' % (pd.Timestamp.now()-start).total_seconds())
ARR done in 9.559764
TUP done in 70.457208

So now from_tuples is significantly slower though source data are the same.

like image 915
BeforeFlight Avatar asked Jun 13 '19 19:06

BeforeFlight


People also ask

How to create a multiindex in a pandas Dataframe?

To create a MultiIndex with our original DataFrame, all we need to do is pass a list of columns into the .set_index () Pandas function like this: Here, we can already see that the new DataFrame called “multi” has been organized so that there are now four columns that make up the index.

What is the multi-level index feature in pandas?

The multi-level index feature in Pandas allows you to do just that. A regular Pandas DataFrame has a single column that acts as a unique row identifier, or in other words, an “index”.

What is a multiindex in Python?

You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays () ), an array of tuples (using MultiIndex.from_tuples () ), a crossed set of iterables (using MultiIndex.from_product () ), or a DataFrame (using MultiIndex.from_frame () ).

How do you create a multi index in a multiindex?

A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples (using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a DataFrame (using MultiIndex.from_frame()).


2 Answers

Your second example makes more sense to me. Looking at the source code for Pandas, from_tuples actually calls from_arrays, so it makes sense to me that from_arrays will be faster.

from_tuples is also doing some extra steps here that cost more time:

  1. You passed in a zip(d1, d2), which is actually an iterator. from_tuples converts this into a list.
  2. After it was converted to a list of tuples, it goes through an extra step to convert it to a list of numpy arrays
  3. The previous step iterates through the list of tuples twice, making the from_tuples significantly slower than from_arrays, right off the bat.

So overall, I'm not surprised from_tuples is slower, since it has to iterate through your list of tuples an extra two times (and do some extra stuff) before even making it to the from_arrays function (which iterates a couple more times, by the way) that it uses anyways.

like image 87
Caleb Courtney Avatar answered Oct 20 '22 05:10

Caleb Courtney


from_tuples converts iterators to lists, then lists to arrays, then arrays into lists of arrays, then ultimately calls from_arrays on that.

like image 36
Zach Langer Avatar answered Oct 20 '22 06:10

Zach Langer