Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas scalar value getting and setting: ix or iat?

I'm trying to figure out when to use different selecting methods in pandas DataFrame. In particular, I'm looking for accessing scalar values. I often hear ix being generally recommended. But in pandas documentation it's recommended to use at and iat for fast scalar value accessing:

Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for. If you only want to access a scalar value, the fastest way is to use theatand iat methods, which are implemented on all of the data structures.

So, I would assume iat should be faster for getting and setting individual cells. However, after some tests, we found that ix would be comparable or faster for reading cells, while iat much faster for assigning values to cells.

Is this behavior documented anywhere? Is it always the case and why does this happen? Does it have to do something with returning view or copy? I would appreciate if someone could put any light on this question and explain what is recommended for getting and setting cell values and why.

Here are some tests using pandas (version 0.15.2).

Just to make sure that this behavior is not a bug of this version, I also tested it on 0.11.0. I do not provide the results, but the trend is exactly the same - ix being much faster for getting, and iat for setting individual cells.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(1000,2),columns = ['A','B'])
idx = 0

timeit for i in range(1000): df.ix[i,'A'] = 1
timeit for i in range(1000): df.iat[i,idx] = 2

>> 10 loops, best of 3: 92.6 ms per loop
>> 10 loops, best of 3: 21.7 ms per loop

timeit for i in range(1000): tmp = df.ix[i,'A'] 
timeit for i in range(1000): tmp = df.iat[i,idx] 

>> 100 loops, best of 3: 5.31 ms per loop
>> 10 loops, best of 3: 19.4 ms per loop
like image 254
ojy Avatar asked Mar 06 '15 22:03

ojy


1 Answers

Pandas does some pretty interesting things with the indexing classes. I don't think I am capable of describing a simple way to know which to use but I can give some insight on the implementation.

DataFrame#ix is an _IXIndexer which does not declare its own __getitem__ or __setitem__. These two methods are important because they control how values are accessed with Pandas. Since _IXIndexer does not declare these methods the super class _NDFrameIndexer's are used instead.

Further digging on the _NDFrameIndexer's __getitem__ shows that it is relatively simple and in some cases wraps the logic found in get_value. Then __getitem__ is close to as fast as get_value for some scenarios.

_NDFrameIndexer's __setitem__ is a different story. At first it looks simple but the second method it calls is _setitem_with_indexer which does a considerable amount of work for most scenarios.

This information suggests that calls to get values using ix are limited by get_value in the best case and calls to set values using ix would take a core committer to explain.

Now for DataFrame#iat which is an _iAtIndexer which also doesn't declare its own __getitem__ or __setitem__ therefor falling back to its super class _ScalarAccessIndexer's implementation.

_ScalarAccessIndexer has a simple __getitem__ implementation but it requires a loop in order to convert the key into the proper format. The additional loop adds some extra processing time before calling get_value.

_ScalarAccessIndexer also has a fairly simple __setitem__ implementation which converts the key the parameters set_value requires before setting the value.

This information suggests that calls to get values using iat are limited by get_value as well as a for loop. Setting values with iat are primarily limited by calls to set_value. So getting values with iat has a bit of an overhead, while setting them has a smaller overhead.

TL;DR

I believe you are using the correct accessor for an Int64Index index based on the documentation but I don't think that means it is the fastest. The best performance can be found using get_value and set_value directly but they require an extra depth of knowledge in how Pandas DataFrames are implemented.

Notes

It is worth noting that the documentation on Pandas mentions that get_value and set_value are deprecated which I believe was meant to be iget_value instead.

Examples

In order to show the difference in performance using a few indexers (including directly calling get_value and set_value) I made this script:

example.py:

import timeit


def print_index_speed(stmnt_name, stmnt):
    """
    Repeatedly run the statement provided then repeat the process and take the
    minimum execution time.
    """
    setup = """
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(1000,2),columns = ['A','B'])
idx = 0
    """

    minimum_execution_time = min(
        timeit.Timer(stmnt, setup=setup).repeat(5, 10))

    print("{stmnt_name}: {time}".format(
        stmnt_name=stmnt_name,
        time=round(minimum_execution_time, 5)))

print_index_speed("set ix", "for i in range(1000): df.ix[i, 'A'] = 1")
print_index_speed("set at", "for i in range(1000): df.at[i, 'A'] = 2")
print_index_speed("set iat", "for i in range(1000): df.iat[i, idx] = 3")
print_index_speed("set loc", "for i in range(1000): df.loc[i, 'A'] = 4")
print_index_speed("set iloc", "for i in range(1000): df.iloc[i, idx] = 5")
print_index_speed(
    "set_value scalar",
    "for i in range(1000): df.set_value(i, idx, 6, True)")
print_index_speed(
    "set_value label",
    "for i in range(1000): df.set_value(i, 'A', 7, False)")

print_index_speed("get ix", "for i in range(1000): tmp = df.ix[i, 'A']")
print_index_speed("get at", "for i in range(1000): tmp = df.at[i, 'A']")
print_index_speed("get iat", "for i in range(1000): tmp = df.iat[i, idx]")
print_index_speed("get loc", "for i in range(1000): tmp = df.loc[i, 'A']")
print_index_speed("get iloc", "for i in range(1000): tmp = df.iloc[i, idx]")
print_index_speed(
    "get_value scalar",
    "for i in range(1000): tmp = df.get_value(i, idx, True)")
print_index_speed(
    "get_value label",
    "for i in range(1000): tmp = df.get_value(i, 'A', False)")

Output:

set ix: 0.9918
set at: 0.06801
set iat: 0.08606
set loc: 1.04173
set iloc: 1.0021
set_value: 0.0452
**set_value**: 0.03516
get ix: 0.04827
get at: 0.06889
get iat: 0.07813
get loc: 0.8966
get iloc: 0.87484
get_value: 0.04994
**get_value**: 0.03111
like image 65
erik-e Avatar answered Oct 19 '22 22:10

erik-e