Performance tests for creating equal <code>pd.MultiIndex</code> using different class methods: <pre class="prettyprint lang-py prettyprint-override"><code>import pandas as pd size_mult = 8 d1 = [1]*10**size_mult d2 = [2]*10**size_mult pd.__version__ </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code>'0.24.2' </code></pre> Namely <code>.from_arrays</code>, <code>from_tuples</code>, <code>from_frame</code>: <pre class="prettyprint lang-py prettyprint-override"><code># Cell from_arrays %%time index_arr = pd.MultiIndex.from_arrays([d1, d2], names=['a', 'b']) # Cell from_tuples %%time index_tup = pd.MultiIndex.from_tuples(zip(d1, d2), names=['a', 'b']) # Cell from_frame %%time df = pd.DataFrame({'a':d1, 'b':d2}) index_frm = pd.MultiIndex.from_frame(df) </code></pre> Corresponding outputs for cells: <pre class="prettyprint lang-py prettyprint-override"><code># from_arrays CPU times: user 1min 15s, sys: 6.58 s, total: 1min 21s Wall time: 1min 21s # from_tuples CPU times: user 26.4 s, sys: 4.99 s, total: 31.4 s Wall time: 31.3 s # from_frame CPU times: user 47.9 s, sys: 5.65 s, total: 53.6 s Wall time: 53.7 s </code></pre> And let's check that all results are the same for the case <pre class="prettyprint lang-py prettyprint-override"><code>index_arr.difference(index_tup) index_arr.difference(index_frm) </code></pre> All lines produce: <pre class="prettyprint lang-py prettyprint-override"><code>MultiIndex(levels=[[1], [2]], codes=[[], []], names=['a', 'b']) </code></pre> So why is there so big difference? <code>from_arrays</code> is almost 3 times slower than <code>from_tuples</code>. It is even slower than create DataFrame and build index on top of it. EDIT: I've done another more generalized test and result was surprisingly the opposite: <pre class="prettyprint lang-py prettyprint-override"><code>np.random.seed(232) size_mult = 7 d1 = np.random.randint(0, 10**size_mult, 10**size_mult) d2 = np.random.randint(0, 10**size_mult, 10**size_mult) start = pd.Timestamp.now() index_arr = pd.MultiIndex.from_arrays([d1, d2], names=['a', 'b']) print('ARR done in %f' % (pd.Timestamp.now()-start).total_seconds()) start = pd.Timestamp.now() index_tup = pd.MultiIndex.from_tuples(zip(d1, d2), names=['a', 'b']) print('TUP done in %f' % (pd.Timestamp.now()-start).total_seconds()) </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code>ARR done in 9.559764 TUP done in 70.457208 </code></pre> So now <code>from_tuples</code> is significantly slower though source data are the same.

Your second example makes more sense to me. Looking at the source code for Pandas, <code>from_tuples</code> actually calls <code>from_arrays</code>, so it makes sense to me that <code>from_arrays</code> will be faster. <code>from_tuples</code> is also doing some extra steps here that cost more time: <ol> <li>You passed in a <code>zip(d1, d2)</code>, which is actually an iterator. <code>from_tuples</code> converts this into a list.</li> <li>After it was converted to a list of tuples, it goes through an extra step to convert it to a list of numpy arrays </li> <li>The previous step iterates through the list of tuples twice, making the <code>from_tuples</code> significantly slower than <code>from_arrays</code>, right off the bat.</li> </ol> So overall, I'm not surprised <code>from_tuples</code> is slower, since it has to iterate through your list of tuples an extra two times (and do some extra stuff) before even making it to the <code>from_arrays</code> function (which iterates a couple more times, by the way) that it uses anyways.

Pandas multiindex creation performance

Performance tests for creating equal pd.MultiIndex using different class methods:

import pandas as pd

size_mult = 8
d1 = [1]*10**size_mult
d2 = [2]*10**size_mult

pd.__version__

'0.24.2'

Namely .from_arrays, from_tuples, from_frame:

# Cell from_arrays
%%time
index_arr = pd.MultiIndex.from_arrays([d1, d2], names=['a', 'b'])
# Cell from_tuples
%%time
index_tup = pd.MultiIndex.from_tuples(zip(d1, d2), names=['a', 'b'])
# Cell from_frame
%%time
df = pd.DataFrame({'a':d1, 'b':d2})
index_frm = pd.MultiIndex.from_frame(df)

Corresponding outputs for cells:

# from_arrays
CPU times: user 1min 15s, sys: 6.58 s, total: 1min 21s
Wall time: 1min 21s
# from_tuples
CPU times: user 26.4 s, sys: 4.99 s, total: 31.4 s
Wall time: 31.3 s
# from_frame
CPU times: user 47.9 s, sys: 5.65 s, total: 53.6 s
Wall time: 53.7 s

And let's check that all results are the same for the case

index_arr.difference(index_tup)
index_arr.difference(index_frm)

All lines produce:

MultiIndex(levels=[[1], [2]],
           codes=[[], []],
           names=['a', 'b'])

So why is there so big difference? from_arrays is almost 3 times slower than from_tuples. It is even slower than create DataFrame and build index on top of it.

EDIT:

I've done another more generalized test and result was surprisingly the opposite:

np.random.seed(232)

size_mult = 7
d1 = np.random.randint(0, 10**size_mult, 10**size_mult)
d2 = np.random.randint(0, 10**size_mult, 10**size_mult)

start = pd.Timestamp.now()
index_arr = pd.MultiIndex.from_arrays([d1, d2], names=['a', 'b'])
print('ARR done in %f' % (pd.Timestamp.now()-start).total_seconds())

start = pd.Timestamp.now()
index_tup = pd.MultiIndex.from_tuples(zip(d1, d2), names=['a', 'b'])
print('TUP done in %f' % (pd.Timestamp.now()-start).total_seconds())

ARR done in 9.559764
TUP done in 70.457208

So now from_tuples is significantly slower though source data are the same.

How to create a multiindex in a pandas Dataframe?

To create a MultiIndex with our original DataFrame, all we need to do is pass a list of columns into the .set_index () Pandas function like this: Here, we can already see that the new DataFrame called “multi” has been organized so that there are now four columns that make up the index.

What is the multi-level index feature in pandas?

The multi-level index feature in Pandas allows you to do just that. A regular Pandas DataFrame has a single column that acts as a unique row identifier, or in other words, an “index”.

What is a multiindex in Python?

You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays () ), an array of tuples (using MultiIndex.from_tuples () ), a crossed set of iterables (using MultiIndex.from_product () ), or a DataFrame (using MultiIndex.from_frame () ).

How do you create a multi index in a multiindex?

A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples (using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a DataFrame (using MultiIndex.from_frame()).

Your second example makes more sense to me. Looking at the source code for Pandas, from_tuples actually calls from_arrays, so it makes sense to me that from_arrays will be faster.

from_tuples is also doing some extra steps here that cost more time:

You passed in a zip(d1, d2), which is actually an iterator. from_tuples converts this into a list.
After it was converted to a list of tuples, it goes through an extra step to convert it to a list of numpy arrays
The previous step iterates through the list of tuples twice, making the from_tuples significantly slower than from_arrays, right off the bat.

So overall, I'm not surprised from_tuples is slower, since it has to iterate through your list of tuples an extra two times (and do some extra stuff) before even making it to the from_arrays function (which iterates a couple more times, by the way) that it uses anyways.

from_tuples converts iterators to lists, then lists to arrays, then arrays into lists of arrays, then ultimately calls from_arrays on that.

Pandas multiindex creation performance

Tags:

performance

python

pandas

multi-index

BeforeFlight

People also ask

2 Answers

Caleb Courtney

Zach Langer

Recent Activity

Donate For Us

Pandas multiindex creation performance

Tags:

performance

python

pandas

multi-index

BeforeFlight

People also ask

2 Answers

Caleb Courtney

Zach Langer

Related questions

Recent Activity

Donate For Us