I'm trying to figure out when to use different selecting methods in pandas DataFrame. In particular, I'm looking for accessing scalar values. I often hear ix
being generally recommended. But in pandas documentation
it's recommended to use at
and iat
for fast scalar value accessing:
Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for. If you only want to access a scalar value, the fastest way is to use the
atand iat methods, which are implemented on all of the data structures.
So, I would assume iat
should be faster for getting and setting individual cells. However, after some tests, we found that ix
would be comparable or faster for reading cells, while iat
much faster for assigning values to cells.
Is this behavior documented anywhere? Is it always the case and why does this happen? Does it have to do something with returning view or copy? I would appreciate if someone could put any light on this question and explain what is recommended for getting and setting cell values and why.
Here are some tests using pandas (version 0.15.2).
Just to make sure that this behavior is not a bug of this version, I also tested it on 0.11.0. I do not provide the results, but the trend is exactly the same - ix being much faster for getting, and iat for setting individual cells
.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(1000,2),columns = ['A','B'])
idx = 0
timeit for i in range(1000): df.ix[i,'A'] = 1
timeit for i in range(1000): df.iat[i,idx] = 2
>> 10 loops, best of 3: 92.6 ms per loop
>> 10 loops, best of 3: 21.7 ms per loop
timeit for i in range(1000): tmp = df.ix[i,'A']
timeit for i in range(1000): tmp = df.iat[i,idx]
>> 100 loops, best of 3: 5.31 ms per loop
>> 10 loops, best of 3: 19.4 ms per loop
Pandas does some pretty interesting things with the indexing classes. I don't think I am capable of describing a simple way to know which to use but I can give some insight on the implementation.
DataFrame#ix
is an _IXIndexer
which does not declare its own __getitem__
or __setitem__
. These two methods are important because they control how values are accessed with Pandas. Since _IXIndexer
does not declare these methods the super class _NDFrameIndexer
's are used instead.
Further digging on the _NDFrameIndexer
's __getitem__
shows that it is relatively simple and in some cases wraps the logic found in get_value
. Then __getitem__
is close to as fast as get_value
for some scenarios.
_NDFrameIndexer
's __setitem__
is a different story. At first it looks simple but the second method it calls is _setitem_with_indexer
which does a considerable amount of work for most scenarios.
This information suggests that calls to get values using ix
are limited by get_value
in the best case and calls to set values using ix
would take a core committer to explain.
Now for DataFrame#iat
which is an _iAtIndexer
which also doesn't declare its own __getitem__
or __setitem__
therefor falling back to its super class _ScalarAccessIndexer
's implementation.
_ScalarAccessIndexer
has a simple __getitem__
implementation but it requires a loop in order to convert the key into the proper format. The additional loop adds some extra processing time before calling get_value
.
_ScalarAccessIndexer
also has a fairly simple __setitem__
implementation which converts the key the parameters set_value
requires before setting the value.
This information suggests that calls to get values using iat
are limited by get_value
as well as a for loop. Setting values with iat
are primarily limited by calls to set_value
. So getting values with iat
has a bit of an overhead, while setting them has a smaller overhead.
TL;DR
I believe you are using the correct accessor for an Int64Index
index based on the documentation but I don't think that means it is the fastest. The best performance can be found using get_value
and set_value
directly but they require an extra depth of knowledge in how Pandas DataFrames are implemented.
Notes
It is worth noting that the documentation on Pandas mentions that get_value
and set_value
are deprecated which I believe was meant to be iget_value
instead.
Examples
In order to show the difference in performance using a few indexers (including directly calling get_value
and set_value
) I made this script:
example.py
:
import timeit
def print_index_speed(stmnt_name, stmnt):
"""
Repeatedly run the statement provided then repeat the process and take the
minimum execution time.
"""
setup = """
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(1000,2),columns = ['A','B'])
idx = 0
"""
minimum_execution_time = min(
timeit.Timer(stmnt, setup=setup).repeat(5, 10))
print("{stmnt_name}: {time}".format(
stmnt_name=stmnt_name,
time=round(minimum_execution_time, 5)))
print_index_speed("set ix", "for i in range(1000): df.ix[i, 'A'] = 1")
print_index_speed("set at", "for i in range(1000): df.at[i, 'A'] = 2")
print_index_speed("set iat", "for i in range(1000): df.iat[i, idx] = 3")
print_index_speed("set loc", "for i in range(1000): df.loc[i, 'A'] = 4")
print_index_speed("set iloc", "for i in range(1000): df.iloc[i, idx] = 5")
print_index_speed(
"set_value scalar",
"for i in range(1000): df.set_value(i, idx, 6, True)")
print_index_speed(
"set_value label",
"for i in range(1000): df.set_value(i, 'A', 7, False)")
print_index_speed("get ix", "for i in range(1000): tmp = df.ix[i, 'A']")
print_index_speed("get at", "for i in range(1000): tmp = df.at[i, 'A']")
print_index_speed("get iat", "for i in range(1000): tmp = df.iat[i, idx]")
print_index_speed("get loc", "for i in range(1000): tmp = df.loc[i, 'A']")
print_index_speed("get iloc", "for i in range(1000): tmp = df.iloc[i, idx]")
print_index_speed(
"get_value scalar",
"for i in range(1000): tmp = df.get_value(i, idx, True)")
print_index_speed(
"get_value label",
"for i in range(1000): tmp = df.get_value(i, 'A', False)")
Output:
set ix: 0.9918
set at: 0.06801
set iat: 0.08606
set loc: 1.04173
set iloc: 1.0021
set_value: 0.0452
**set_value**: 0.03516
get ix: 0.04827
get at: 0.06889
get iat: 0.07813
get loc: 0.8966
get iloc: 0.87484
get_value: 0.04994
**get_value**: 0.03111
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With