<p>All of the following seem to be working for iterating through the elements of a pandas Series. I'm sure there's more ways of doing it. What are the differences and which is the best way?</p> <pre class="prettyprint lang-py prettyprint-override"><code>import pandas arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3]) # 1 for el in arr: print(el) # 2 for _, el in arr.iteritems(): print(el) # 3 for el in arr.array: print(el) # 4 for el in arr.values: print(el) # 5 for i in range(len(arr)): print(arr.iloc[i]) </code></pre>

<h3>TL;DR</h3> <p><em>Iterating in pandas is an antipattern and can usually be avoided by vectorizing, applying, aggregating, transforming, or cythonizing.</em></p> <p>However if Series iteration is absolutely necessary, performance will depend on the dtype and index:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th>Index</th> <th>Fastest if numpy dtype </th> <th>Fastest if pandas dtype </th> <th>Idiomatic</th> </tr></thead> <tbody> <tr> <td><sup>Unneeded</sup></td> <td><sup><code>in s.to_numpy()</code></sup></td> <td><sup><code>in s.array</code></sup></td> <td><sup><code>in s</code></sup></td> </tr> <tr> <td><sup>Default</sup></td> <td><sup><code>in enumerate(s.to_numpy())</code></sup></td> <td><sup><code>in enumerate(s.array)</code></sup></td> <td><sup><code>in s.items()</code></sup></td> </tr> <tr> <td><sup>Custom</sup></td> <td><sup><code>in zip(s.index, s.to_numpy())</code></sup></td> <td><sup><code>in s.items()</code></sup></td> <td><sup><code>in s.items()</code></sup></td> </tr> </tbody> </table> </div> <hr> <h3>For numpy-based Series, use <code>s.to_numpy()</code> </h3> <ol> <li> <p>If the Series is a python or numpy dtype, it's usually fastest to iterate the underlying numpy ndarray:</p> <pre class="prettyprint lang-py prettyprint-override"><code>for el in s.to_numpy(): # if dtype is datetime, int, float, str, string </code></pre> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th style="text-align: center;">datetime</th> </tr></thead> <tbody><tr> <td style="text-align: center;"><img src="https://i.stack.imgur.com/7EFLN.png" alt="iteration timings for datetime Series (no index)"></td> </tr></tbody> </table> </div> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th style="text-align: center;">int</th> <th style="text-align: center;">float</th> <th style="text-align: center;">float + nan</th> <th style="text-align: center;">str</th> <th style="text-align: center;">string</th> </tr></thead> <tbody><tr> <td style="text-align: center;"><img src="https://i.stack.imgur.com/Bddtz.png" width="250" alt="iteration timings for int Series (no index)"></td> <td style="text-align: center;"><img src="https://i.stack.imgur.com/XJGsl.png" width="250" alt="iteration timings for float Series (no index)"></td> <td style="text-align: center;"><img src="https://i.stack.imgur.com/fjPtF.png" width="250" alt="iteration timings for float Series (no index)"></td> <td style="text-align: center;"><img src="https://i.stack.imgur.com/QAd6j.png" width="250" alt="iteration timings for str Series (no index)"></td> <td style="text-align: center;"><img src="https://i.stack.imgur.com/9qlnG.png" width="250" alt="iteration timings for string Series (no index)"></td> </tr></tbody> </table> </div> </li> <li> <p><strong>To access the index,</strong> it's actually fastest to <code>enumerate()</code> or <code>zip()</code> the numpy ndarray:</p> <pre class="prettyprint lang-py prettyprint-override"><code>for i, el in enumerate(s.to_numpy()): # if default range index </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code>for i, el in zip(s.index, s.to_numpy()): # if custom index </code></pre> <p>Both are faster than the idiomatic <code>s.items()</code> / <code>s.iteritems()</code>:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th style="text-align: center;">datetime + index</th> </tr></thead> <tbody><tr> <td style="text-align: center;"><img src="https://i.stack.imgur.com/QOFpJ.png" alt="iteration timings for datetime Series (with index)"></td> </tr></tbody> </table> </div> </li> <li> <p><strong>To micro-optimize,</strong> switch to <code>s.tolist()</code> for shorter <code>int</code>/<code>float</code>/<code>str</code> Series:</p> <pre class="prettyprint lang-py prettyprint-override"><code>for el in s.to_numpy(): # if >100K elements </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code>for el in s.tolist(): # to micro-optimize if <100K elements </code></pre> <p><sup><em>Warning: Do not use <code>list(s)</code> as it doesn't use compiled code which makes it slower.</em></sup></p> </li> </ol> <hr> <h3>For pandas-based Series, use <code>s.array</code> or <code>s.items()</code> </h3> <p>Pandas extension dtypes contain extra (meta)data, e.g.:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th>pandas dtype</th> <th>contents</th> </tr></thead> <tbody> <tr> <td><code>Categorical</code></td> <td>2 arrays</td> </tr> <tr> <td><code>DatetimeTZ</code></td> <td>array + timezone metadata</td> </tr> <tr> <td><code>Interval</code></td> <td>2 arrays</td> </tr> <tr> <td><code>Period</code></td> <td>array + frequency metadata</td> </tr> <tr> <td>...</td> <td>...</td> </tr> </tbody> </table> </div> <p>Converting these extension arrays to numpy "may be expensive" since it could involve copying/coercing the data, so:</p> <ol> <li> <p>If the Series is a pandas extension dtype, it's generally fastest to iterate the underlying pandas array:</p> <pre class="prettyprint lang-py prettyprint-override"><code>for el in s.array: # if dtype is pandas-only extension </code></pre> <p>For example, with ~100 unique <code>Categorical</code> values:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th style="text-align: center;">Categorical</th> </tr></thead> <tbody><tr> <td style="text-align: center;"><img src="https://i.stack.imgur.com/EyzC5.png" alt="iteration timings for Categorical Series (no index)"></td> </tr></tbody> </table> </div> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th style="text-align: center;">DatetimeTZ</th> <th style="text-align: center;">Period</th> <th style="text-align: center;">Interval</th> </tr></thead> <tbody><tr> <td style="text-align: center;"><img src="https://i.stack.imgur.com/Xw4xe.png" width="250" alt="iteration timings for DatetimeTZ Series (no index)"></td> <td style="text-align: center;"><img src="https://i.stack.imgur.com/fQF7i.png" width="250" alt="iteration timings for Period Series (no index)"></td> <td style="text-align: center;"><img src="https://i.stack.imgur.com/hrIko.png" width="250" alt="iteration timings for Interval Series (no index)"></td> </tr></tbody> </table> </div> </li> <li> <p><strong>To access the index,</strong> the idiomatic <code>s.items()</code> is very fast for pandas dtypes:</p> <pre class="prettyprint lang-py prettyprint-override"><code>for i, el in s.items(): # if need index for pandas-only dtype </code></pre> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th style="text-align: center;">DatetimeTZ + index</th> <th style="text-align: center;">Interval + index</th> <th style="text-align: center;">Period + index</th> </tr></thead> <tbody><tr> <td style="text-align: center;"><img src="https://i.stack.imgur.com/rS6jD.png" width="250" alt="iteration timings for DatetimeTZ Series (with index)"></td> <td style="text-align: center;"><img src="https://i.stack.imgur.com/cgNFS.png" width="250" alt="iteration timings for Interval Series (with index)"></td> <td style="text-align: center;"><img src="https://i.stack.imgur.com/U0wOi.png" width="250" alt="iteration timings for Period Series (with index)"></td> </tr></tbody> </table> </div> </li> <li> <p><strong>To micro-optimize,</strong> switch to the slightly faster <code>enumerate()</code> for default-indexed <code>Categorical</code> arrays:</p> <pre class="prettyprint lang-py prettyprint-override"><code>for i, el in enumerate(s.array): # to micro-optimize Categorical dtype if need default range index </code></pre> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th style="text-align: center;">Categorical + index</th> </tr></thead> <tbody><tr> <td style="text-align: center;"><img src="https://i.stack.imgur.com/CfnVr.png" alt="iteration timings for Categorical Series (with index)"></td> </tr></tbody> </table> </div> </li> </ol> <hr> <h3>Caveats</h3> <ol> <li> <p><strong>Avoid using <code>s.values</code></strong>:</p> <ul> <li>Use <code>s.to_numpy()</code> to get the underlying numpy ndarray</li> <li>Use <code>s.array</code> to get the underlying pandas array</li> </ul> </li> <li> <p><strong>Avoid modifying the iterated Series</strong>:</p> <blockquote> <p>You should <strong>never modify</strong> something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!</p> </blockquote> </li> <li> <p><strong>Avoid iterating manually</strong> whenever possible by instead:</p> <ol> <li> <p>Vectorizing, (boolean) indexing, etc.</p> </li> <li> <p>Applying functions, e.g.:</p> <ul> <li><code>s.apply(some_function)</code></li> <li><code>s.agg(['min', 'max', 'mean'])</code></li> <li><code>s.transform([np.sqrt, np.exp])</code></li> </ul> <p><sup><em>Note: These are <strong>not</strong> vectorizations despite the common misconception.</em></sup></p> </li> <li> <p>Offloading to cython/numba</p> </li> </ol> </li> </ol> <hr> <p><sub><em>Specs: ThinkPad X1 Extreme Gen 3 (Core i7-10850H 2.70GHz, 32GB DDR4 2933MHz)</em></sub><br><sub><em>Versions: <code>python==3.9.2</code>, <code>pandas==1.3.1</code>, <code>numpy==1.20.2</code></em></sub><br><sub><em>Testing data: Series generation code in snippet</em></sub></p> <p></p> <div class="snippet" data-lang="js" data-hide="true" data-console="true" data-babel="true"> <div class="snippet-code snippet-currently-hidden"> <pre class="prettyprint snippet-code-js lang-js prettyprint-override"><code>''' Note: This is python code in a js snippet, so "run code snippet" will not work. The snippet is just to avoid cluttering the main post with supplemental code. ''' import pandas as pd import numpy as np int_series = pd.Series(np.random.randint(1000000000, size=n)) float_series = pd.Series(np.random.randn(size=n)) floatnan_series = pd.Series(np.random.choice([np.nan, np.inf]*n + np.random.randn(n).tolist(), size=n)) str_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype(str) string_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype('string') datetime_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01'), size=n)) datetimetz_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01', tz='CET'), size=n)) categorical_series = pd.Series(np.random.randint(100, size=n)).astype('category') interval_series = pd.Series(pd.arrays.IntervalArray.from_arrays(-np.random.random(size=n), np.random.random(size=n))) period_series = pd.Series(pd.period_range(end='2021-01-01', periods=n, freq='s'))</code></pre> </div> </div>

<p>Use <code>items</code>:</p> <pre class="prettyprint"><code>for i, v in arr.items(): print(f'index: {i} and value: {v}') </code></pre> <p>Output:</p> <pre class="prettyprint"><code>index: 0 and value: 1 index: 1 and value: 1 index: 2 and value: 1 index: 3 and value: 2 index: 4 and value: 2 index: 5 and value: 2 index: 6 and value: 3 index: 7 and value: 3 </code></pre>

best way to iterate through elements of pandas Series

Tags:

python

pandas

All of the following seem to be working for iterating through the elements of a pandas Series. I'm sure there's more ways of doing it. What are the differences and which is the best way?

import pandas


arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])

# 1
for el in arr:
    print(el)

# 2
for _, el in arr.iteritems():
    print(el)

# 3
for el in arr.array:
    print(el)

# 4
for el in arr.values:
    print(el)

# 5
for i in range(len(arr)):
    print(arr.iloc[i])

398

asked Aug 05 '21 18:08

d.b

5 Answers

TL;DR

Iterating in pandas is an antipattern and can usually be avoided by vectorizing, applying, aggregating, transforming, or cythonizing.

However if Series iteration is absolutely necessary, performance will depend on the dtype and index:

Index	Fastest if numpy dtype	Fastest if pandas dtype	Idiomatic
^Unneeded	^{in s.to_numpy()}	^{in s.array}	^{in s}
^Default	^{in enumerate(s.to_numpy())}	^{in enumerate(s.array)}	^{in s.items()}
^Custom	^{in zip(s.index, s.to_numpy())}	^{in s.items()}	^{in s.items()}

For numpy-based Series, use `s.to_numpy()`

If the Series is a python or numpy dtype, it's usually fastest to iterate the underlying numpy ndarray:
```
for el in s.to_numpy(): # if dtype is datetime, int, float, str, string
```
datetime

int float float + nan str string
To access the index, it's actually fastest to enumerate() or zip() the numpy ndarray:
```
for i, el in enumerate(s.to_numpy()): # if default range index
```
```
for i, el in zip(s.index, s.to_numpy()): # if custom index
```
Both are faster than the idiomatic s.items() / s.iteritems():

datetime + index
To micro-optimize, switch to s.tolist() for shorter int/float/str Series:
```
for el in s.to_numpy(): # if >100K elements
```
```
for el in s.tolist(): # to micro-optimize if <100K elements
```
^{Warning: Do not use list(s) as it doesn't use compiled code which makes it slower.}

For pandas-based Series, use `s.array` or `s.items()`

Pandas extension dtypes contain extra (meta)data, e.g.:

pandas dtype	contents
`Categorical`	2 arrays
`DatetimeTZ`	array + timezone metadata
`Interval`	2 arrays
`Period`	array + frequency metadata
...	...

Converting these extension arrays to numpy "may be expensive" since it could involve copying/coercing the data, so:

If the Series is a pandas extension dtype, it's generally fastest to iterate the underlying pandas array:
```
for el in s.array: # if dtype is pandas-only extension
```
For example, with ~100 unique Categorical values:

Categorical

DatetimeTZ Period Interval
To access the index, the idiomatic s.items() is very fast for pandas dtypes:
```
for i, el in s.items(): # if need index for pandas-only dtype
```
DatetimeTZ + index Interval + index Period + index
To micro-optimize, switch to the slightly faster enumerate() for default-indexed Categorical arrays:
```
for i, el in enumerate(s.array): # to micro-optimize Categorical dtype if need default range index
```
Categorical + index

Caveats

Avoid using s.values:
- Use s.to_numpy() to get the underlying numpy ndarray
- Use s.array to get the underlying pandas array
Avoid modifying the iterated Series:

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!
Avoid iterating manually whenever possible by instead:
1. Vectorizing, (boolean) indexing, etc.
2. Applying functions, e.g.:
  - s.apply(some_function)
  - s.agg(['min', 'max', 'mean'])
  - s.transform([np.sqrt, np.exp])
  ^{Note: These are not vectorizations despite the common misconception.}
3. Offloading to cython/numba

_{Specs: ThinkPad X1 Extreme Gen 3 (Core i7-10850H 2.70GHz, 32GB DDR4 2933MHz)}
_{Versions: python==3.9.2, pandas==1.3.1, numpy==1.20.2}
_{Testing data: Series generation code in snippet}

'''
Note: This is python code in a js snippet, so "run code snippet" will not work.
The snippet is just to avoid cluttering the main post with supplemental code.
'''

import pandas as pd
import numpy as np

int_series = pd.Series(np.random.randint(1000000000, size=n))
float_series = pd.Series(np.random.randn(size=n))
floatnan_series = pd.Series(np.random.choice([np.nan, np.inf]*n + np.random.randn(n).tolist(), size=n))
str_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype(str)
string_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype('string')
datetime_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01'), size=n))
datetimetz_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01', tz='CET'), size=n))
categorical_series = pd.Series(np.random.randint(100, size=n)).astype('category')
interval_series = pd.Series(pd.arrays.IntervalArray.from_arrays(-np.random.random(size=n), np.random.random(size=n)))
period_series = pd.Series(pd.period_range(end='2021-01-01', periods=n, freq='s'))

133

answered Oct 17 '22 21:10

tdy

Use items:

for i, v in arr.items():
    print(f'index: {i} and value: {v}')

Output:

index: 0 and value: 1
index: 1 and value: 1
index: 2 and value: 1
index: 3 and value: 2
index: 4 and value: 2
index: 5 and value: 2
index: 6 and value: 3
index: 7 and value: 3

answered Oct 17 '22 20:10

Scott Boston

The test results are as follows: the execution speed of the loop is the slowest. Iterrows () is optimized for the dataframe of pandas, which is significantly improved compared with the direct loop. The apply () method also loops between rows, but it is much more efficient than iterrows because of a series of global optimizations using iterators like python. The vectorization of numpy arrays runs fastest, followed by the vectorization of pandas series. Since vectorization works on the whole sequence at the same time, it can save more time. Numpy uses precompiled C code to optimize at the bottom, and avoids a lot of overhead in the operation of pandas series. Therefore, the operation of numpy arrays is much faster than that of pandas series.

loop: 1.80301690102 
iterrows: 0.724927186966 
apply: 0.645957946777
pandas series: 0.333024024963 
numpy array: 0.260366916656

loop of the list > numpy array > pandas series > apply > iterrows

answered Oct 17 '22 20:10

lazy

Ways to iterate through pandas/python

arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])

#Using Python range() method
for i in range(len(arr)):
    print(arr[i])

range doesn’t include the end value in the sequence

#List Comprehension
print([arr[i] for i in range(len(arr))])

List comprehension can work with and can identify whether the input is a list, string or tuple

#Using Python enumerate() method
for el,j in enumerate(arr):
    print(j)
#Using Python NumPy module
import numpy as np
print(np.arange(len(arr)))
for i,j in np.ndenumerate(arr):
    print(j)

enumerate is very widely used as enumerate adds a counter to the list or any other iterable and returns it as an enumerate object by the function. It reduces the overhead of keeping a count of the elements while the iteration operation. You wouldn't require a counter here. You could use np.ndenumerate() to mimic the behavior of enumerate for numpy arrays. For very large n-dimensional lists it is advisable to use numpy.

You also use traditional for Loop and also a while Loop

x=0
while x<len(arr):
    print(arr[x])
    x +=1
    
#Using lambda function
list(map(lambda x:x, arr))

lambda reduces the lines of code and can be used along side filter, reduce or map.

If you want to iterate through rows of dataframe rather than the series, we could use iterrows, itertuple and iteritems. The best way in terms of memory and computation is to use the columns as vectors and performing vector computations using numpy arrays. Loops are super expensive when it comes to bigdata. Its easier and quicker when you make them numpy arrays and work on it.

answered Oct 17 '22 22:10

Sonia Samipillai

I believe, the more important is to understand the requirement over cosmetics while looking around a solution for an individual requirement.

In my opinion, it doesn't cost too much until the data we are working on is huge, where we have to be selective in our approach rest for small dataset either approach will be fine as mentioned below..

There are good explanation in PEP 469, PEP 3106 and Views And Iterators Instead Of Lists

In Python 3, there is only one method named items(). It uses iterators so it is fast and allows traversing the dictionary while editing. Note that the method iteritems() was removed from Python 3.

One can have a look at Python3 Wiki Built-In_Changes to get more details on it.

arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])
$ for index, value in arr.items():
   print(f"Index : {index}, Value : {value}")

Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3

$ for index, value in arr.iteritems():
   print(f"Index : {index}, Value : {value}")
   
Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3

$ for _, value in arr.iteritems():
   print(f"Index : {index}, Value : {value}")

Index : 7, Value : 1
Index : 7, Value : 1
Index : 7, Value : 1
Index : 7, Value : 2
Index : 7, Value : 2
Index : 7, Value : 2
Index : 7, Value : 3
Index : 7, Value : 3

$ for i, v in enumerate(arr):
   print(f"Index : {i}, Value : {v}")
Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3

$ for value in arr:
   print(value)

1
1
1
2
2
2
3
3



$ for value in arr.tolist():
   print(value)

1
1
1
2
2
2
3
3

There is a good post about How to iterate over rows in a DataFrame in Pandas though it says df but it explains all about item() , iteritems() etc.

Another good discussion over SO items & iteritems.

answered Oct 17 '22 22:10

Karn Kumar

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

best way to iterate through elements of pandas Series

Tags:

python

pandas

d.b

People also ask

5 Answers

TL;DR

For numpy-based Series, use `s.to_numpy()`

For pandas-based Series, use `s.array` or `s.items()`

Caveats

tdy

Scott Boston

lazy

Sonia Samipillai

Karn Kumar

Recent Activity

Donate For Us

best way to iterate through elements of pandas Series

Tags:

python

pandas

d.b

People also ask

5 Answers

TL;DR

For numpy-based Series, use s.to_numpy()

For pandas-based Series, use s.array or s.items()

Caveats

tdy

Scott Boston

lazy

Sonia Samipillai

Karn Kumar

Related questions

Recent Activity

Donate For Us

For numpy-based Series, use `s.to_numpy()`

For pandas-based Series, use `s.array` or `s.items()`