Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why not use .values rather than .iat for 6x performance improvement?

Tags:

python

pandas

I was surprised at the 6x performance improvement I got by accessing series elements using my_series.values[0] rather than my_series.iat[0].

According to the documentation, .iat is the recommended way for fast access to scalars. Am I missing anything by using .values?

import numpy as np
import pandas as pd

n = 1000
dct = {'A': np.random.rand(n)}
df = pd.DataFrame(dct)
s = df['A']
vals = s.values

%timeit -n 10000 val = s.iloc[0]
%timeit -n 10000 val = s.iat[0]
%timeit -n 10000 val = s.values[0]
%timeit -n 10000 vals[0]

**Output**
10000 loops, best of 3: 24.3 µs per loop
10000 loops, best of 3: 13.4 µs per loop
10000 loops, best of 3: 2.06 µs per loop
10000 loops, best of 3: 337 ns per loop
like image 909
jerenrich Avatar asked Oct 04 '16 15:10

jerenrich


1 Answers

Based on a some experimentation, it appears that the speed difference between iat and values narrows considerably if you have multiple columns (which is usually the case).

n = 1000
dct = {'A': np.random.rand(n), 'B': np.random.rand(n) }
df = pd.DataFrame(dct)

%timeit df.iat[n-5,1]
100000 loops, best of 3: 9.72 µs per loop

%timeit df.B.values[n-5]
100000 loops, best of 3: 7.3 µs per loop

What may also be interesting is that it can matter whether you access the cell directly, or first select the column, and then the row.

In the case of iat, it's better to use it on the full dataframe:

%timeit df.iat[n-5,1]
100000 loops, best of 3: 9.72 µs per loop

%timeit df.B.iat[n-5]
100000 loops, best of 3: 15.4 µs per loop

But in the case of values, it's better to select the column, and then use values:

%timeit df.values[n-5,1]
100000 loops, best of 3: 9.42 µs per loop

%timeit df.B.values[n-5]
100000 loops, best of 3: 7.3 µs per loop

But in any event, using values instead of iat seems to offer comparable speed at worst, so there appears to be little value added from iat over values if you are using position based indexing (unless you prefer the syntax).

Conversely, label based indexing is not possible with values and in that case at will be much faster than using loc in combination with values.

(timings above using pandas version 0.18.0)

like image 90
JohnE Avatar answered Nov 15 '22 19:11

JohnE