Does pandas iterrows have performance issues?

Tags:

I have noticed very poor performance when using iterrows from pandas.

Is it specific to iterrows and should this function be avoided for data of a certain size (I'm working with 2-3 million rows)?

This discussion on GitHub led me to believe it is caused when mixing dtypes in the dataframe, however the simple example below shows it is there even when using one dtype (float64). This takes 36 seconds on my machine:

import pandas as pd
import numpy as np
import time

s1 = np.random.randn(2000000)
s2 = np.random.randn(2000000)
dfa = pd.DataFrame({'s1': s1, 's2': s2})

start = time.time()
i=0
for rowindex, row in dfa.iterrows():
    i+=1
end = time.time()
print end - start

Why are vectorized operations like apply so much quicker? I imagine there must be some row by row iteration going on there too.

I cannot figure out how to not use iterrows in my case (this I'll save for a future question). Therefore I would appreciate hearing if you have consistently been able to avoid this iteration. I'm making calculations based on data in separate dataframes.

A simplified version of what I want to run:

import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b'],
      'number1':[50,-10]}

t2 = {'letter':['a','a','b','b'],
      'number2':[0.2,0.5,0.1,0.4]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=[0])

#%% Iterate through filtering relevant data, optimizing, returning info
for row_index, row in table1.iterrows():
    t2info = table2[table2.letter == row['letter']].reset_index()
    table3.ix[row_index,] = optimize(t2info,row['number1'])

#%% Define optimization
def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2']*t1info)
    maxrow = calculation.index(max(calculation))
    return t2info.ix[maxrow]

931

asked Jul 21 '14 17:07

KieranPC

5 Answers

Here's the way to do your problem. This is all vectorized.

In [58]: df = table1.merge(table2,on='letter')

In [59]: df['calc'] = df['number1']*df['number2']

In [60]: df
Out[60]: 
  letter  number1  number2  calc
0      a       50      0.2    10
1      a       50      0.5    25
2      b      -10      0.1    -1
3      b      -10      0.4    -4

In [61]: df.groupby('letter')['calc'].max()
Out[61]: 
letter
a         25
b         -1
Name: calc, dtype: float64

In [62]: df.groupby('letter')['calc'].idxmax()
Out[62]: 
letter
a         1
b         2
Name: calc, dtype: int64

In [63]: df.loc[df.groupby('letter')['calc'].idxmax()]
Out[63]: 
  letter  number1  number2  calc
1      a       50      0.5    25
2      b      -10      0.1    -1

149

answered Oct 04 '22 03:10

Jeff

Vector operations in Numpy and pandas are much faster than scalar operations in vanilla Python for several reasons:

Amortized type lookup: Python is a dynamically typed language, so there is runtime overhead for each element in an array. However, Numpy (and thus pandas) perform calculations in C (often via Cython). The type of the array is determined only at the start of the iteration; this savings alone is one of the biggest wins.
Better caching: Iterating over a C array is cache-friendly and thus very fast. A pandas DataFrame is a "column-oriented table", which means that each column is really just an array. So the native actions you can perform on a DataFrame (like summing all the elements in a column) are going to have few cache misses.
More opportunities for parallelism: A simple C array can be operated on via SIMD instructions. Some parts of Numpy enable SIMD, depending on your CPU and installation process. The benefits to parallelism won't be as dramatic as the static typing and better caching, but they're still a solid win.

Moral of the story: use the vector operations in Numpy and pandas. They are faster than scalar operations in Python for the simple reason that these operations are exactly what a C programmer would have written by hand anyway. (Except that the array notion is much easier to read than explicit loops with embedded SIMD instructions.)

answered Oct 04 '22 03:10

chrisaycock

DO NOT use iterrows!

...Or iteritems, or itertuples. Seriously, don't. Wherever possible, seek to vectorize your code. If you don't believe me, ask Jeff.

I will concede that there are legitimate use cases for iterating over a DataFrame, but there are far better alternatives for iteration than iter* family functions, namely

cython/numba
list comprehensions, and
(in rare cases) apply.

Often too many beginners to pandas ask questions involving code that has something to do with iterrows. Since these new users are likely not familiar with the concept of vectorization, they envision the code that solves their problem as something that involves loops or other iterative routines. Not knowing how to iterate either, they usually end up at this question and learn all the wrong things.

Supporting Arguments

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

If that doesn't convince you, take a look at the performance comparison between vectorized and versus non-vectorized techniques for adding two columns "A + B", taken from my post here.

Benchmarking code, for your reference. iterrows is by far the worst of the lot, and it's also worth pointing out that the other iterative methods aren't much better either.

The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you're doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).

In Conclusion

Always seek to vectorize. Sometimes, based on the nature of your problem or data this is not always possible, so seek better iterative routines than iterrows. There is almost never a legitimate use case for this besides convenience when dealing with an extremely small number of rows, otherwise be prepared for a lot of waiting while your code runs for hours potentially.

Check out the links below to determine the best method/vectorized routine to solve your code.

10 Minutes to pandas, and Essential Basic Functionality - Useful links that introduce you to Pandas and its library of vectorized*/cythonized functions.
Enhancing Performance - A primer from the documentation on enhancing standard Pandas operations

answered Oct 04 '22 03:10

cs95

Another option is to use to_records(), which is faster than both itertuples and iterrows.

But for your case, there is much room for other types of improvements.

Here's my final optimized version

def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        # np.multiply is in general faster than "x * y"
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]`  removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    global table3
    table3 = pd.DataFrame(ret, columns=('letter', 'number2'))

Benchmark test:

-- iterrows() --
100 loops, best of 3: 12.7 ms per loop
  letter  number2
0      a      0.5
1      b      0.1
2      c      5.0
3      d      4.0

-- itertuple() --
100 loops, best of 3: 12.3 ms per loop

-- to_records() --
100 loops, best of 3: 7.29 ms per loop

-- Use group by --
100 loops, best of 3: 4.07 ms per loop
  letter  number2
1      a      0.5
2      b      0.1
4      c      5.0
5      d      4.0

-- Avoid multiplication --
1000 loops, best of 3: 1.39 ms per loop
  letter  number2
0      a      0.5
1      b      0.1
2      c      5.0
3      d      4.0

Full code:

import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b','c','d'],
      'number1':[50,-10,.5,3]}

t2 = {'letter':['a','a','b','b','c','d','c'],
      'number2':[0.2,0.5,0.1,0.4,5,4,1]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=table1.index)


print('\n-- iterrows() --')

def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2'] * t1info)
    maxrow_in_t2 = calculation.index(max(calculation))
    return t2info.loc[maxrow_in_t2]

#%% Iterate through filtering relevant data, optimizing, returning info
def iterthrough():
    for row_index, row in table1.iterrows():   
        t2info = table2[table2.letter == row['letter']].reset_index()
        table3.iloc[row_index,:] = optimize(t2info, row['number1'])

%timeit iterthrough()
print(table3)

print('\n-- itertuple() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.itertuples():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.itertuples():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()


print('\n-- to_records() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.to_records():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.to_records():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()

print('\n-- Use group by --')

def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    for index, letter, n1 in table1.to_records():
        t2 = table2.iloc[grouped.groups[letter]]
        calculation = t2.number2 * n1
        maxrow = calculation.argsort().iloc[-1]
        ret.append(t2.iloc[maxrow])
    global table3
    table3 = pd.DataFrame(ret)

%timeit iterthrough()
print(table3)

print('\n-- Even Faster --')
def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]`  removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    global table3
    table3 = pd.DataFrame(ret, columns=('letter', 'number2'))

%timeit iterthrough()
print(table3)

The final version is almost 10x faster than the original code. The strategy is:

Use groupby to avoid repeated comparing of values.
Use to_records to access raw numpy.records objects.
Don't operate on DataFrame until you have compiled all the data.

answered Oct 04 '22 03:10

Polor Beer

Details in this video

Benchmark enter image description here

answered Oct 04 '22 02:10

artoby

Related questions
                            
                                How do I install a pip package globally instead of locally?
                            
                                How to share x axes of two subplots after they have been created
                            
                                What's the fastest way of checking if a point is inside a polygon in python
                            
                                Django: multiple models in one template using forms [closed]
                            
                                Python Pandas equivalent in JavaScript
                            
                                Python: Select subset from list based on index set
                            
                                What is the Bash equivalent of Python's pass statement
                            
                                What is the point of uWSGI?
                            
                                sphinx-build fail - autodoc can't import/find module
                            
                                Calling filter returns <filter object at ... > [duplicate]
                            
                                plot with custom text for x axis points
                            
                                The modulo operation on negative numbers in Python
                            
                                Single Line Nested For Loops
                            
                                One-liner to check whether an iterator yields at least one element?
                            
                                How do you set your pythonpath in an already-created virtualenv?
                            
                                Pandas groupby cumulative sum
                            
                                Tensorflow Strides Argument
                            
                                Where is pip cache folder?
                            
                                How to pass another entire column as argument to pandas fillna()
                            
                                ImportError: No module named six

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does pandas iterrows have performance issues?

Tags:

performance

python

iteration

pandas

KieranPC

People also ask

5 Answers

Jeff

chrisaycock

DO NOT use iterrows!

Supporting Arguments

In Conclusion

cs95

Polor Beer

artoby

Recent Activity

Donate For Us