Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I access a numpy array as quickly as a pandas dataframe

I ran a comparison of several ways to access data in a DataFrame. See results below. The quickest access was from using the get_value method on a DataFrame. I was referred to this on this post.

What I was surprised by is that the access via get_value is quicker than accessing via the underlying numpy object df.values.

Question

My question is, is there a way to access elements of a numpy array as quickly as I can access a pandas dataframe via get_value?

Setup

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(16).reshape(4, 4))

Testing

%%timeit
df.iloc[2, 2]

10000 loops, best of 3: 108 µs per loop

%%timeit
df.values[2, 2]

The slowest run took 5.42 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 8.02 µs per loop

%%timeit
df.iat[2, 2]

The slowest run took 4.96 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 9.85 µs per loop

%%timeit
df.get_value(2, 2)

The slowest run took 19.29 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 3.57 µs per loop

like image 330
piRSquared Avatar asked Oct 30 '22 03:10

piRSquared


1 Answers

iloc is pretty general, accepting slices and lists as well as simple integers. In the case above, where you have simple integer indexing, pandas first determines that it is a valid integer, then it converts the request to an iat index, so clearly it will be much slower. iat eventually resolves down to a call to get_value, so naturally a direct call to get_value is going to be fast. get_value itself is cached, so micro-benchmarks like these may not reflect performance in real code.

df.values does return an ndarray, but only after checking that it is a single contiguous block. This requires a few lookups and tests so it is a little slower than retrieving the value from the cache.

We can defeat the caching by creating a new data frame every time. This shows that values accessor is fastest, at least for data of a uniform type:

In [111]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4))
10000 loops, best of 3: 186 µs per loop

In [112]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.values[2,2]
1000 loops, best of 3: 200 µs per loop

In [113]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.get_value(2,2)
1000 loops, best of 3: 309 µs per loop

In [114]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iat[2,2]
1000 loops, best of 3: 308 µs per loop

In [115]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iloc[2,2]
1000 loops, best of 3: 420 µs per loop

In [116]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.ix[2,2]
1000 loops, best of 3: 316 µs per loop

The code claims that ix is the most general, and so should be in theory be slower than iloc; it may be that your particular test favours ix but other tests may favour iloc just because of the order of the tests needed to identify the index as a scalar index.

like image 137
Neapolitan Avatar answered Nov 14 '22 03:11

Neapolitan