Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the computational complexity of .iloc() in pandas dataframes?

I'm trying to understand what's the execution complexity of the iloc function in pandas. I read the following Stack Exchange thread (Pandas DataFrame search is linear time or constant time?) that:

"accessing single row by index (index is sorted and unique) should have runtime O(m) where m << n_rows"

mentioning that iloc runs on O(m) time. What is m (linear, log, constant,...)?

Some experiments I ran:

import pandas as pd
>>> a = pd.DataFrame([[1,2,3],[1,3,4],[2,3,4],[2,4,5]], columns=['a','b','c'])
>>> a = a.set_index('a').sort_index()
>>> a
  b c
a
1 3 4
1 4 5
2 2 3
2 3 4

>>> a.iloc[[0,1,2,3]]
  b c
a
1 3 4
1 4 5
2 2 3
2 3 4

So iloc clearly works with offsets and not on the integer-based index (column a). Even if we delete few rows at the top, the iloc offset-based lookup works correctly:

>>> a.drop([1]).iloc[[0,1]]
  b c
a
2 2 3
2 3 4

So why isn't iloc offset-lookup running on a comparable time to numpy arrays when each column is simply a numpy array that can be accessed in constant time (few operations)? And what's its complexity?

UPDATE:

I tried to compare the efficiency of pandas vs numpy on a 10000000x2 matrix. Comparing the efficiency of a value increment per row in a DataFrame df and an array arr, with and without a for loop:

# Initialization
SIZE = 10000000
arr = np.ones((SIZE,2), dtype=np.uint32)
df = pd.DataFrame(arr)

# numpy, no for-loop
arr[range(SIZE),1] += 1

# pandas, no for-loop
df.iloc[range(SIZE),1] += 1

# numpy, for-loop
for i in range(SIZE):
  arr[i,1] += 1

# pandas, for-loop
for i in range(SIZE):
  df.iloc[i,1] += 1
Method Execution time
numpy, no for-loop 7 seconds
pandas, no for-loop 24 seconds
numpy, with for-loop 27 seconds
pandas, with for-loop > 2 hours
like image 275
Bruno Maga Avatar asked Aug 19 '20 10:08

Bruno Maga


1 Answers

There likely isn't one answer for the runtime complexity of iloc. The method accepts a huge range of input types, and that flexibility necessarily comes with costs. These costs are likely to include both large constant factors and non-constant costs that are almost certainly dependent on the way in which it is used.

One way to sort of answer your question is to step through the code in the two cases.

Indexing with range

First, indexing with range(SIZE). Assuming df is defined as you did, you can run:

import pdb
pdb.run('df.iloc[range(SIZE), 1]')

and then step through the code to follow the path. Ultimately, this arrives at this line:

self._values.take(indices)

where indices is an ndarray of integers constructed from the original range, and self._values is the source ndarray of the data frame.

There are two things to note about this. First, the range is materialized into an ndarray, which means you have a memory allocation of at least SIZE elements. So...that's going to cost you some time :). I don't know how the indexing happens in NumPy itself, but given the time measurements you've produced, it's possible that there is no (or a much smaller) allocation happening.

The second thing to note is that numpy.take makes a copy. You can verify this by looking at the .flags attribute of the object returned from calling this method, which indicates that it owns its data and is not a view onto the original. (Also note that np.may_share_memory returns False.) So...there's another allocation there :).

Take home: It's not obvious that there's any non-linear runtime here, but there are clearly large constant factors. Multiple allocations are probably the big killer, but the complex branching logic in the call tree under the .iloc property surely doesn't help.

Indexing in a loop

The code taken in this path is much shorter. It pretty quickly arrives here:

return self.obj._get_value(*key, takeable=self._takeable)

The really crappy runtime here is probably due to that tuple-unpacking. That repeatedly unpacks and repacks key as a tuple on each loop iteration. (Note that key is already the tuple (i, 1), so that sucks. The cost of accepting a generic iterable.)

Runtime analysis

In any case, we can get an estimate of the actual runtime of your particular use case by profiling. The following script will generate a log-spaced list of array sizes from 10 to 10e9, index with a range, and print out the time it takes to run the __getitem__ method. (There are only two such calls in the tree, so it's easy to see which is the one we care about.)

import pandas as pd
import numpy as np

import cProfile
import pstats

sizes = [10 ** i for i in range(1, 9)]
for size in sizes:
    df = pd.DataFrame(data=np.zeros((size, 2)))
    with cProfile.Profile() as pr:
        pr.enable()
        df.iloc[range(size), 1]
        pr.disable()
        stats = pstats.Stats(pr)
        print(size)
        stats.print_stats("__getitem__")

Once the output gets above the minimum resolution, you can see pretty clear linear behavior here:

Size     | Runtime
------------------
10000    | 0.002
100000   | 0.021
1000000  | 0.206
10000000 | 2.145
100000000| 24.843

So I'm not sure what sources you're referring to that talk about nonlinear runtime of indexing. They could be out of date, or considering a different code path than the one using range.

like image 68
bnaecker Avatar answered Sep 28 '22 21:09

bnaecker