I ran a comparison of several ways to access data in a DataFrame
. See results below. The quickest access was from using the get_value
method on a DataFrame
. I was referred to this on this post.
What I was surprised by is that the access via get_value
is quicker than accessing via the underlying numpy object df.values
.
My question is, is there a way to access elements of a numpy array as quickly as I can access a pandas dataframe via get_value
?
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(16).reshape(4, 4))
%%timeit
df.iloc[2, 2]
10000 loops, best of 3: 108 µs per loop
%%timeit
df.values[2, 2]
The slowest run took 5.42 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 8.02 µs per loop
%%timeit
df.iat[2, 2]
The slowest run took 4.96 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 9.85 µs per loop
%%timeit
df.get_value(2, 2)
The slowest run took 19.29 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 3.57 µs per loop
iloc
is pretty general, accepting slices and lists as well as simple integers. In the case above, where you have simple integer indexing, pandas first determines that it is a valid integer, then it converts the request to an iat
index, so clearly it will be much slower. iat
eventually resolves down to a call to get_value
, so naturally a direct call to get_value
is going to be fast. get_value
itself is cached, so micro-benchmarks like these may not reflect performance in real code.
df.values
does return an ndarray, but only after checking that it is a single contiguous block. This requires a few lookups and tests so it is a little slower than retrieving the value from the cache.
We can defeat the caching by creating a new data frame every time. This shows that values
accessor is fastest, at least for data of a uniform type:
In [111]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4))
10000 loops, best of 3: 186 µs per loop
In [112]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.values[2,2]
1000 loops, best of 3: 200 µs per loop
In [113]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.get_value(2,2)
1000 loops, best of 3: 309 µs per loop
In [114]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iat[2,2]
1000 loops, best of 3: 308 µs per loop
In [115]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iloc[2,2]
1000 loops, best of 3: 420 µs per loop
In [116]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.ix[2,2]
1000 loops, best of 3: 316 µs per loop
The code claims that ix
is the most general, and so should be in theory be slower than iloc
; it may be that your particular test favours ix
but other tests may favour iloc
just because of the order of the tests needed to identify the index as a scalar index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With