Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is DataFrame.loc[[1]] 1,800x slower than df.ix [[1]] and 3,500x than df.loc[1]?

Try this for yourself:

import pandas as pd
s=pd.Series(xrange(5000000))
%timeit s.loc[[0]] # You need pandas 0.15.1 or newer for it to be that slow
1 loops, best of 3: 445 ms per loop

Update: that is a legitimate bug in pandas that was probably introduced in 0.15.1 in August, 2014 or so. Workarounds: wait for a new release while using an old version of pandas; get a cutting-edge dev. version from github; manually do a one-line modification in your release of pandas; temporarily use .ix instead of .loc.

I have a DataFrame with 4.8 million rows, and selecting a single row using .iloc[[ id ]](with a single-element list) takes 489 ms, almost half a second, 1,800x times slower than the identical .ix[[ id ]], and 3,500x times slower than .iloc[id] (passing the id as a value, not as a list). To be fair, .loc[list] takes about the same time regardless of the length of the list, but I don't want to spend 489 ms on it, especially when .ix is a thousand times faster, and produces identical result. It was my understanding that .ix was supposed to be slower, wasn't it?

I am using pandas 0.15.1. The excellent tutorial on Indexing and Selecting Data suggests that .ix is somehow more general, and presumably slower, than .loc and .iloc. Specifically, it says

However, when an axis is integer based, ONLY label based access and not positional access is supported. Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.

Here is an iPython session with the benchmarks:

    print 'The dataframe has %d entries, indexed by integers that are less than %d' % (len(df), max(df.index)+1)
    print 'df.index begins with ', df.index[:20]
    print 'The index is sorted:', df.index.tolist()==sorted(df.index.tolist())

    # First extract one element directly. Expected result, no issues here.
    id=5965356
    print 'Extract one element with id %d' % id
    %timeit df.loc[id]
    %timeit df.ix[id]
    print hash(str(df.loc[id])) == hash(str(df.ix[id])) # check we get the same result

    # Now extract this one element as a list.
    %timeit df.loc[[id]] # SO SLOW. 489 ms vs 270 microseconds for .ix, or 139 microseconds for .loc[id]
    %timeit df.ix[[id]] 
    print hash(str(df.loc[[id]])) == hash(str(df.ix[[id]]))  # this one should be True
    # Let's double-check that in this case .ix is the same as .loc, not .iloc, 
    # as this would explain the difference.
    try:
        print hash(str(df.iloc[[id]])) == hash(str(df.ix[[id]]))
    except:
        print 'Indeed, %d is not even a valid iloc[] value, as there are only %d rows' % (id, len(df))

    # Finally, for the sake of completeness, let's take a look at iloc
    %timeit df.iloc[3456789]    # this is still 100+ times faster than the next version
    %timeit df.iloc[[3456789]]

Output:

The dataframe has 4826616 entries, indexed by integers that are less than 6177817
df.index begins with  Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype='int64')
The index is sorted: True
Extract one element with id 5965356
10000 loops, best of 3: 139 µs per loop
10000 loops, best of 3: 141 µs per loop
True
1 loops, best of 3: 489 ms per loop
1000 loops, best of 3: 270 µs per loop
True
Indeed, 5965356 is not even a valid iloc[] value, as there are only 4826616 rows
10000 loops, best of 3: 98.9 µs per loop
100 loops, best of 3: 12 ms per loop
like image 495
Sergey Orshanskiy Avatar asked Dec 22 '14 05:12

Sergey Orshanskiy


1 Answers

Looks like the issue was not present in pandas 0.14. I profiled it with line_profiler, and I think I know what has happened. Since pandas 0.15.1, a KeyError is now raised if a given index is not present. Looks like when you are using the .loc[list] syntax, it is doing an exhaustive search for an index along the entire axis, even if it has been found. That is, first, there is no early termination in case an element is found and, second, the search in this case is brute-force.

File: .../anaconda/lib/python2.7/site-packages/pandas/core/indexing.py,

  1278                                                       # require at least 1 element in the index
  1279         1          241    241.0      0.1              idx = _ensure_index(key)
  1280         1       391040 391040.0     99.9              if len(idx) and not idx.isin(ax).any():
  1281                                           
  1282                                                           raise KeyError("None of [%s] are in the [%s]" %
like image 63
Sergey Orshanskiy Avatar answered Sep 20 '22 21:09

Sergey Orshanskiy