Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Pandas .loc speed in Pandas depends on DataFrame initialization? How to make MultiIndex .loc as fast as possible?

I am trying to improve a code performance. I use Pandas 0.19.2 and Python 3.5.

I just realized that the .loc writing on a whole bunch of values at a time has very different speed depending on dataframe initialization.

Can someone explain why, and tell me what is the best initialization? It would allow me speed up my code.

Here is a toy example. I create 'similar' dataframes.

import pandas as pd
import numpy as np


ncols = 1000
nlines = 1000

columns = pd.MultiIndex.from_product([[0], [0], np.arange(ncols)])
lines = pd.MultiIndex.from_product([[0], [0], np.arange(nlines)])

#df has multiindex
df = pd.DataFrame(columns = columns, index = lines)

#df2 has mono-index, and is initialized a certain way
df2 = pd.DataFrame(columns = np.arange(ncols), index = np.arange(nlines))
for i in range(ncols):
    df2[i] = i*np.arange(nlines)

#df3 is mono-index and not initialized
df3 = pd.DataFrame(columns = np.arange(ncols), index = np.arange(nlines))

#df4 is mono-index and initialized another way compared to df2
df4 = pd.DataFrame(columns = np.arange(ncols), index = np.arange(nlines))
for i in range(ncols):
    df4[i] = i

Then I time them:

%timeit df.loc[(0, 0, 0), (0, 0)] = 2*np.arange(ncols)
1 loop, best of 3: 786 ms per loop
The slowest run took 69.10 times longer than the fastest. This could mean          that an intermediate result is being cached.

%timeit df2.loc[0] = 2*np.arange(ncols)
1000 loops, best of 3: 275 µs per loop

%timeit df3.loc[0] = 2*np.arange(ncols)
10 loops, best of 3: 31.4 ms per loop

%timeit df4.loc[0] = 2*np.arange(ncols)
10 loops, best of 3: 63.9 ms per loop

Have I done anything wrong???? Why is df2 performing so much faster than the others? Actually in multi-index case it is much faster to set the elements one by one using .at. I implemented this solution in my code but I'm not happy about it, I think there must be a better solution. I would prefer to keep my nice multi-index dataframes, but if I really need to go mono-index I'll do it.

def mod(df, arr, ncols):
    for j in range(ncols):
        df.at[(0, 0, 0),(0, 0, j)] = arr[j]
    return df

%timeit mod(df, np.arange(ncols), ncols)
The slowest run took 10.44 times longer than the fastest. This could mean  that an intermediate result is being cached.
100 loops, best of 3: 14.6 ms per loop
like image 785
tk. Avatar asked Jan 20 '17 22:01

tk.


People also ask

Why pandas are faster in Python?

pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files). If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array.

Is pandas DataFrame fast?

Pandas is all around excellent. But Pandas isn't particularly fast. When you're dealing with many computations and your processing method is slow, the program takes a long time to run.

Is pandas query faster than loc?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

What does mean .loc in pandas?

loc in Pandas loc is label-based, which means that we have to specify the name of the rows and columns that we need to filter out. For example, let's say we search for the rows whose index is 1, 2 or 100. We will not get the first, second or the hundredth row here.


1 Answers

One difference I see here is you have (effectively) initialized df2 & df4 with dtype=int64 but df & df3 with dtype=object. You could initialize with empty real values like this for df2 & df4:

#df has multiindex
df = pd.DataFrame(np.empty([ncols,nlines]), 
                  columns = columns, index = lines)

#df3 is mono-index and not initialized
df3 = pd.DataFrame(np.empty([ncols,nlines]),
                   columns = np.arange(ncols), index = np.arange(nlines))

You could also add dtype=int to initialize as integers rather reals but that didn't seem to matter as far as speed.

I get a much faster timing than you did for df4 (with no difference in code), so that's a mystery to me. Anyway, with the above changes to df & df3 the timings are close for df2 to df4, but unfortunately df is still quite slow.

%timeit df.loc[(0, 0, 0), (0, 0)] = 2*np.arange(ncols)
1 loop, best of 3: 418 ms per loop

%timeit df2.loc[:,0] = 2*np.arange(ncols)
10000 loops, best of 3: 185 µs per loop

%timeit df3.loc[0] = 2*np.arange(ncols)
10000 loops, best of 3: 116 µs per loop

%timeit df4.loc[:,0] = 2*np.arange(ncols)
10000 loops, best of 3: 196 µs per loop

Edit to add:

As far your larger problem with the multi-index, I dunno, but 2 thoughts:

1) Expanding on @ptrj's comment, I get a very fast timing for his suggestion (about the same as the simple-index methods):

%timeit df.loc[(0, 0, 0) ] = 2*np.arange(ncols)
10000 loops, best of 3: 133 µs per loop

So I again get a very different timing from you (?). And FWIW, when you want the whole row with loc/iloc it is recommended to use : rather than leaving the column reference blank:

timeit df.loc[(0, 0, 0), : ] = 2*np.arange(ncols)
1000 loops, best of 3: 223 µs per loop

But as you can see it's a bit slower, so I dunno which way to suggest here. I guess you should generally do it as recommended by the documentation, but on the other hand this may be an important difference in speed for you.

2) Alternatively, this is rather brute force-ish, but you could just save your index/columns, reset the index/columns to be simple, then set index/columns back to multi. Although, that's not really any different from just taking df.values and I suspect not that convenient for you.

like image 150
JohnE Avatar answered Sep 28 '22 06:09

JohnE