I want to perform my own complex operations on financial data in dataframes in a sequential manner. For example I am using the following MSFT CSV file taken from Yahoo Finance: <pre class="prettyprint"><code>Date,Open,High,Low,Close,Volume,Adj Close 2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13 2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31 2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98 2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27 .... </code></pre> I then do the following: <pre class="prettyprint"><code>#!/usr/bin/env python from pandas import * df = read_csv('table.csv') for i, row in enumerate(df.values): date = df.index[i] open, high, low, close, adjclose = row #now perform analysis on open/close based on date, etc.. </code></pre> Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? <code>df.iteritems</code> unfortunately only iterates column by column.

The newest versions of pandas now include a built-in function for iterating over rows. <pre class="prettyprint"><code>for index, row in df.iterrows(): # do some logic here </code></pre> Or, if you want it faster use <code>itertuples()</code> But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming. <pre class="prettyprint"><code>t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)}) B = [] C = [] A = time.time() for i,r in t.iterrows(): C.append((r['a'], r['b'])) B.append(time.time()-A) C = [] A = time.time() for ir in t.itertuples(): C.append((ir[1], ir[2])) B.append(time.time()-A) C = [] A = time.time() for r in zip(t['a'], t['b']): C.append((r[0], r[1])) B.append(time.time()-A) print B </code></pre> Result: <pre class="prettyprint"><code>[0.5639059543609619, 0.017839908599853516, 0.005645036697387695] </code></pre> This is probably not the best way to measure the time consumption but it's quick for me. Here are some pros and cons IMHO: <ul> <li>.iterrows(): return index and row items in separate variables, but significantly slower</li> <li>.itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index</li> <li>zip: quickest, but no access to index of the row</li> </ul> <h3>EDIT 2020/11/10</h3> For what it is worth, here is an updated benchmark with some other alternatives (perf with MacBookPro 2,4 GHz Intel Core i9 8 cores 32 Go 2667 MHz DDR4) <pre class="prettyprint lang-py prettyprint-override"><code>import sys import tqdm import time import pandas as pd B = [] t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)}) for _ in tqdm.tqdm(range(10)): C = [] A = time.time() for i,r in t.iterrows(): C.append((r['a'], r['b'])) B.append({"method": "iterrows", "time": time.time()-A}) C = [] A = time.time() for ir in t.itertuples(): C.append((ir[1], ir[2])) B.append({"method": "itertuples", "time": time.time()-A}) C = [] A = time.time() for r in zip(t['a'], t['b']): C.append((r[0], r[1])) B.append({"method": "zip", "time": time.time()-A}) C = [] A = time.time() for r in zip(*t.to_dict("list").values()): C.append((r[0], r[1])) B.append({"method": "zip + to_dict('list')", "time": time.time()-A}) C = [] A = time.time() for r in t.to_dict("records"): C.append((r["a"], r["b"])) B.append({"method": "to_dict('records')", "time": time.time()-A}) A = time.time() t.agg(tuple, axis=1).tolist() B.append({"method": "agg", "time": time.time()-A}) A = time.time() t.apply(tuple, axis=1).tolist() B.append({"method": "apply", "time": time.time()-A}) print(f'Python {sys.version} on {sys.platform}') print(f"Pandas version {pd.__version__}") print( pd.DataFrame(B).groupby("method").agg(["mean", "std"]).xs("time", axis=1).sort_values("mean") ) ## Output Python 3.7.9 (default, Oct 13 2020, 10:58:24) [Clang 12.0.0 (clang-1200.0.32.2)] on darwin Pandas version 1.1.4 mean std method zip + to_dict('list') 0.002353 0.000168 zip 0.003381 0.000250 itertuples 0.007659 0.000728 to_dict('records') 0.025838 0.001458 agg 0.066391 0.007044 apply 0.067753 0.006997 iterrows 0.647215 0.019600 </code></pre>

You have three options: By index (simplest): <pre class="prettyprint"><code>>>> for index in df.index: ... print ("df[" + str(index) + "]['B']=" + str(df['B'][index])) </code></pre> With iterrows (most used): <pre class="prettyprint"><code>>>> for index, row in df.iterrows(): ... print ("df[" + str(index) + "]['B']=" + str(row['B'])) </code></pre> With itertuples (fastest): <pre class="prettyprint"><code>>>> for row in df.itertuples(): ... print ("df[" + str(row.Index) + "]['B']=" + str(row.B)) </code></pre> Three options display something like: <pre class="prettyprint"><code>df[0]['B']=125 df[1]['B']=415 df[2]['B']=23 df[3]['B']=456 df[4]['B']=189 df[5]['B']=456 df[6]['B']=12 </code></pre> Source: alphons.io

What is the most efficient way to loop through dataframes with pandas?

Tags:

performance

python

for-loop

pandas

dataframe

I want to perform my own complex operations on financial data in dataframes in a sequential manner.

For example I am using the following MSFT CSV file taken from Yahoo Finance:

Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27

....

I then do the following:

#!/usr/bin/env python
from pandas import *

df = read_csv('table.csv')

for i, row in enumerate(df.values):
    date = df.index[i]
    open, high, low, close, adjclose = row
    #now perform analysis on open/close based on date, etc..

Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems unfortunately only iterates column by column.

744

asked Oct 20 '11 14:10

Muppet

5 Answers

The newest versions of pandas now include a built-in function for iterating over rows.

for index, row in df.iterrows():

    # do some logic here

Or, if you want it faster use itertuples()

But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

159

answered Oct 14 '22 05:10

Nick Crawford

Pandas is based on NumPy arrays. The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.

For example, if close is a 1-d array, and you want the day-over-day percent change,

pct_change = close[1:]/close[:-1]

This computes the entire array of percent changes as one statement, instead of

pct_change = []
for row in close:
    pct_change.append(...)

So try to avoid the Python loop for i, row in enumerate(...) entirely, and think about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.

167

answered Oct 14 '22 07:10

unutbu

Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
    C.append((r['a'], r['b']))
B.append(time.time()-A)

C = []
A = time.time()
for ir in t.itertuples():
    C.append((ir[1], ir[2]))    
B.append(time.time()-A)

C = []
A = time.time()
for r in zip(t['a'], t['b']):
    C.append((r[0], r[1]))
B.append(time.time()-A)

print B

Result:

[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]

This is probably not the best way to measure the time consumption but it's quick for me.

Here are some pros and cons IMHO:

.iterrows(): return index and row items in separate variables, but significantly slower
.itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index
zip: quickest, but no access to index of the row

EDIT 2020/11/10

For what it is worth, here is an updated benchmark with some other alternatives (perf with MacBookPro 2,4 GHz Intel Core i9 8 cores 32 Go 2667 MHz DDR4)

import sys
import tqdm
import time
import pandas as pd

B = []
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
for _ in tqdm.tqdm(range(10)):
    C = []
    A = time.time()
    for i,r in t.iterrows():
        C.append((r['a'], r['b']))
    B.append({"method": "iterrows", "time": time.time()-A})

    C = []
    A = time.time()
    for ir in t.itertuples():
        C.append((ir[1], ir[2]))
    B.append({"method": "itertuples", "time": time.time()-A})

    C = []
    A = time.time()
    for r in zip(t['a'], t['b']):
        C.append((r[0], r[1]))
    B.append({"method": "zip", "time": time.time()-A})

    C = []
    A = time.time()
    for r in zip(*t.to_dict("list").values()):
        C.append((r[0], r[1]))
    B.append({"method": "zip + to_dict('list')", "time": time.time()-A})

    C = []
    A = time.time()
    for r in t.to_dict("records"):
        C.append((r["a"], r["b"]))
    B.append({"method": "to_dict('records')", "time": time.time()-A})

    A = time.time()
    t.agg(tuple, axis=1).tolist()
    B.append({"method": "agg", "time": time.time()-A})

    A = time.time()
    t.apply(tuple, axis=1).tolist()
    B.append({"method": "apply", "time": time.time()-A})

print(f'Python {sys.version} on {sys.platform}')
print(f"Pandas version {pd.__version__}")
print(
    pd.DataFrame(B).groupby("method").agg(["mean", "std"]).xs("time", axis=1).sort_values("mean")
)

## Output

Python 3.7.9 (default, Oct 13 2020, 10:58:24) 
[Clang 12.0.0 (clang-1200.0.32.2)] on darwin
Pandas version 1.1.4
                           mean       std
method                                   
zip + to_dict('list')  0.002353  0.000168
zip                    0.003381  0.000250
itertuples             0.007659  0.000728
to_dict('records')     0.025838  0.001458
agg                    0.066391  0.007044
apply                  0.067753  0.006997
iterrows               0.647215  0.019600

129

answered Oct 14 '22 07:10

Richard Wong

You can loop through the rows by transposing and then calling iteritems:

for date, row in df.T.iteritems():
   # do some logic here

I am not certain about efficiency in that case. To get the best possible performance in an iterative algorithm, you might want to explore writing it in Cython, so you could do something like:

def my_algo(ndarray[object] dates, ndarray[float64_t] open,
            ndarray[float64_t] low, ndarray[float64_t] high,
            ndarray[float64_t] close, ndarray[float64_t] volume):
    cdef:
        Py_ssize_t i, n
        float64_t foo
    n = len(dates)

    for i from 0 <= i < n:
        foo = close[i] - open[i] # will be extremely fast

I would recommend writing the algorithm in pure Python first, make sure it works and see how fast it is-- if it's not fast enough, convert things to Cython like this with minimal work to get something that's about as fast as hand-coded C/C++.

answered Oct 14 '22 05:10

Wes McKinney

You have three options:

By index (simplest):

>>> for index in df.index:
...     print ("df[" + str(index) + "]['B']=" + str(df['B'][index]))

With iterrows (most used):

>>> for index, row in df.iterrows():
...     print ("df[" + str(index) + "]['B']=" + str(row['B']))

With itertuples (fastest):

>>> for row in df.itertuples():
...     print ("df[" + str(row.Index) + "]['B']=" + str(row.B))

Three options display something like:

df[0]['B']=125
df[1]['B']=415
df[2]['B']=23
df[3]['B']=456
df[4]['B']=189
df[5]['B']=456
df[6]['B']=12

Source: alphons.io

answered Oct 14 '22 06:10

Fifi

Related questions
                            
                                How can I create an object and add attributes to it?
                            
                                python .replace() regex [duplicate]
                            
                                How are Python's Built In Dictionaries Implemented?
                            
                                Why doesn't os.path.join() work in this case?
                            
                                Why do I get AttributeError: 'NoneType' object has no attribute 'something'?
                            
                                How can I check if a key exists in a dictionary? [duplicate]
                            
                                How to get last items of a list in Python?
                            
                                Can "list_display" in a Django ModelAdmin display attributes of ForeignKey fields?
                            
                                Why is IoC / DI not common in Python?
                            
                                Python String and Integer concatenation [duplicate]
                            
                                Convert datetime object to a String of date only in Python
                            
                                Python 3: ImportError "No Module named Setuptools"
                            
                                Store output of subprocess.Popen call in a string [duplicate]
                            
                                Is a Python list guaranteed to have its elements stay in the order they are inserted in?
                            
                                Adding a legend to PyPlot in Matplotlib in the simplest manner possible
                            
                                Why is "except: pass" a bad programming practice?
                            
                                Why does range(start, end) not include end?
                            
                                Why do many examples use `fig, ax = plt.subplots()` in Matplotlib/pyplot/python
                            
                                'too many values to unpack', iterating over a dict. key=>string, value=>list
                            
                                Secondary axis with twinx(): how to add to legend?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With