Pandas DataFrame performance

Tags:

Pandas is really great, but I am really surprised by how inefficient it is to retrieve values from a Pandas.DataFrame. In the following toy example, even the DataFrame.iloc method is more than 100 times slower than a dictionary.

The question: Is the lesson here just that dictionaries are the better way to look up values? Yes, I get that that is precisely what they were made for. But I just wonder if there is something I am missing about DataFrame lookup performance.

I realize this question is more "musing" than "asking" but I will accept an answer that provides insight or perspective on this. Thanks.

import timeit

setup = '''
import numpy, pandas
df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
dictionary = df.to_dict()
'''

f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]']

for func in f:
    print func
    print min(timeit.Timer(func, setup).repeat(3, 100000))

value = dictionary[5][5]

0.130625009537

value = df.loc[5, 5]

19.4681699276

value = df.iloc[5, 5]

17.2575249672

947

asked Feb 28 '14 01:02

Owen

3 Answers

A dict is to a DataFrame as a bicycle is to a car. You can pedal 10 feet on a bicycle faster than you can start a car, get it in gear, etc, etc. But if you need to go a mile, the car wins.

For certain small, targeted purposes, a dict may be faster. And if that is all you need, then use a dict, for sure! But if you need/want the power and luxury of a DataFrame, then a dict is no substitute. It is meaningless to compare speed if the data structure does not first satisfy your needs.

Now for example -- to be more concrete -- a dict is good for accessing columns, but it is not so convenient for accessing rows.

import timeit  setup = ''' import numpy, pandas df = pandas.DataFrame(numpy.zeros(shape=[10, 1000])) dictionary = df.to_dict() '''  # f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]'] f = ['value = [val[5] for col,val in dictionary.items()]', 'value = df.loc[5]', 'value = df.iloc[5]']  for func in f:     print(func)     print(min(timeit.Timer(func, setup).repeat(3, 100000)))

yields

value = [val[5] for col,val in dictionary.iteritems()] 25.5416321754 value = df.loc[5] 5.68071913719 value = df.iloc[5] 4.56006002426

So the dict of lists is 5 times slower at retrieving rows than df.iloc. The speed deficit becomes greater as the number of columns grows. (The number of columns is like the number of feet in the bicycle analogy. The longer the distance, the more convenient the car becomes...)

This is just one example of when a dict of lists would be less convenient/slower than a DataFrame.

Another example would be when you have a DatetimeIndex for the rows and wish to select all rows between certain dates. With a DataFrame you can use

df.loc['2000-1-1':'2000-3-31']

There is no easy analogue for that if you were to use a dict of lists. And the Python loops you would need to use to select the right rows would again be terribly slow compared to the DataFrame.

162

answered Sep 19 '22 12:09

unutbu

It seems the performance difference is much smaller now (0.21.1 -- I forgot what was the version of Pandas in the original example). Not only the performance gap between dictionary access and .loc reduced (from about 335 times to 126 times slower), loc (iloc) is less than two times slower than at (iat) now.

In [1]: import numpy, pandas    ...:    ...: df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))    ...:    ...: dictionary = df.to_dict()    ...:   In [2]: %timeit value = dictionary[5][5] 85.5 ns ± 0.336 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)  In [3]: %timeit value = df.loc[5, 5] 10.8 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  In [4]: %timeit value = df.at[5, 5] 6.87 µs ± 64.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  In [5]: %timeit value = df.iloc[5, 5] 14.9 µs ± 114 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  In [6]: %timeit value = df.iat[5, 5] 9.89 µs ± 54.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  In [7]: print(pandas.__version__) 0.21.1

---- Original answer below ----

+1 for using at or iat for scalar operations. Example benchmark:

In [1]: import numpy, pandas    ...: df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))    ...: dictionary = df.to_dict()  In [2]: %timeit value = dictionary[5][5] The slowest run took 34.06 times longer than the fastest. This could mean that an intermediate result is being cached  1000000 loops, best of 3: 310 ns per loop  In [4]: %timeit value = df.loc[5, 5] 10000 loops, best of 3: 104 µs per loop  In [5]: %timeit value = df.at[5, 5] The slowest run took 6.59 times longer than the fastest. This could mean that an intermediate result is being cached  100000 loops, best of 3: 9.26 µs per loop  In [6]: %timeit value = df.iloc[5, 5] 10000 loops, best of 3: 98.8 µs per loop  In [7]: %timeit value = df.iat[5, 5] The slowest run took 6.67 times longer than the fastest. This could mean that an intermediate result is being cached  100000 loops, best of 3: 9.58 µs per loop

It seems using at (iat) is about 10 times faster than loc (iloc).

answered Sep 17 '22 12:09

joon

I encountered the same problem. you can use at to improve.

"Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for. If you only want to access a scalar value, the fastest way is to use the at and iat methods, which are implemented on all of the data structures."

see official reference http://pandas.pydata.org/pandas-docs/stable/indexing.html chapter "Fast scalar value getting and setting"

answered Sep 16 '22 12:09

user3566825

Related questions
                            
                                Use Conda environment in pycharm
                            
                                How to run unittest discover from "python setup.py test"?
                            
                                How do I use Django groups and permissions?
                            
                                Normalizing Unicode
                            
                                Determine complete Django url configuration
                            
                                What is validation data used for in a Keras Sequential model?
                            
                                python: how to convert a valid uuid from String to UUID?
                            
                                Initialize a string variable in Python: "" or None?
                            
                                Why does my recursive function return None?
                            
                                What’s the point of inheritance in Python?
                            
                                Listing the dependencies of a package using pip [duplicate]
                            
                                python: is it possible to attach a console into a running process
                            
                                Avoid Pylint warning E1101: 'Instance of .. has no .. member' for class with dynamic attributes
                            
                                Using tqdm progress bar in a while loop
                            
                                Import arbitrary python source file. (Python 3.3+)
                            
                                How do I sum values in a column that match a given condition using pandas?
                            
                                pip cannot uninstall <package>: "It is a distutils installed project"
                            
                                How do I set browser width and height in Selenium WebDriver?
                            
                                Python equivalent of Java StringBuffer?
                            
                                Specify extras_require with pip install -e

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas DataFrame performance

Tags:

python

dictionary

pandas