Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

best way to iterate through elements of pandas Series

Tags:

python

pandas

All of the following seem to be working for iterating through the elements of a pandas Series. I'm sure there's more ways of doing it. What are the differences and which is the best way?

import pandas


arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])

# 1
for el in arr:
    print(el)

# 2
for _, el in arr.iteritems():
    print(el)

# 3
for el in arr.array:
    print(el)

# 4
for el in arr.values:
    print(el)

# 5
for i in range(len(arr)):
    print(arr.iloc[i])
like image 398
d.b Avatar asked Aug 05 '21 18:08

d.b


People also ask

How do you iterate through a pandas series?

Using iterrows() method to iterate rows The iterrows() method is used to iterate over the rows of the pandas DataFrame. It returns a tuple which contains the row index label and the content of the row as a pandas Series. # Iterate over the row values using the iterrows() method for ind, row in df.

Is apply faster than Iterrows?

Option 3 (best for most applications): apply() By using apply and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.

Why is Itertuples faster than Iterrows?

itertuples() method. The main difference between this method and iterrows is that this method is faster than the iterrows method as well as it also preserve the data type of a column compared to the iterrows method which don't as it returns a Series for each row but dtypes are preserved across columns.

How do you iterate over a series object in Python?

iteritems() function iterates over the given series object. the function iterates over the tuples containing the index labels and corresponding value in the series. Example #1: Use Series. iteritems() function to iterate over all the elements in the given series object.

How to iterate over a series in pandas?

Pandas Series.iteritems () function iterates over the given series object. the function iterates over the tuples containing the index labels and corresponding value in the series. Example #1: Use Series.iteritems () function to iterate over all the elements in the given series object.

What is the use of iteritems () function in pandas?

Pandas Series.iteritems () function iterates over the given series object. the function iterates over the tuples containing the index labels and corresponding value in the series. Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

What is the use of indexing in pandas?

The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Pandas Series.iteritems () function iterates over the given series object. the function iterates over the tuples containing the index labels and corresponding value in the series.

How do you access elements in a pandas series?

Accessing elements of a Pandas Series. Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). Labels need not be unique but must be a hashable type. An element in the series can be accessed similarly to that in an ndarray.


5 Answers

TL;DR

Iterating in pandas is an antipattern and can usually be avoided by vectorizing, applying, aggregating, transforming, or cythonizing.

However if Series iteration is absolutely necessary, performance will depend on the dtype and index:

Index Fastest if numpy dtype Fastest if pandas dtype Idiomatic
Unneeded in s.to_numpy() in s.array in s
Default in enumerate(s.to_numpy()) in enumerate(s.array) in s.items()
Custom in zip(s.index, s.to_numpy()) in s.items() in s.items()

For numpy-based Series, use s.to_numpy()

  1. If the Series is a python or numpy dtype, it's usually fastest to iterate the underlying numpy ndarray:

    for el in s.to_numpy(): # if dtype is datetime, int, float, str, string
    
    datetime
    iteration timings for datetime Series (no index)
    int float float + nan str string
    iteration timings for int Series (no index) iteration timings for float Series (no index) iteration timings for float Series (no index) iteration timings for str Series (no index) iteration timings for string Series (no index)
  2. To access the index, it's actually fastest to enumerate() or zip() the numpy ndarray:

    for i, el in enumerate(s.to_numpy()): # if default range index
    
    for i, el in zip(s.index, s.to_numpy()): # if custom index
    

    Both are faster than the idiomatic s.items() / s.iteritems():

    datetime + index
    iteration timings for datetime Series (with index)
  3. To micro-optimize, switch to s.tolist() for shorter int/float/str Series:

    for el in s.to_numpy(): # if >100K elements
    
    for el in s.tolist(): # to micro-optimize if <100K elements
    

    Warning: Do not use list(s) as it doesn't use compiled code which makes it slower.


For pandas-based Series, use s.array or s.items()

Pandas extension dtypes contain extra (meta)data, e.g.:

pandas dtype contents
Categorical 2 arrays
DatetimeTZ array + timezone metadata
Interval 2 arrays
Period array + frequency metadata
... ...

Converting these extension arrays to numpy "may be expensive" since it could involve copying/coercing the data, so:

  1. If the Series is a pandas extension dtype, it's generally fastest to iterate the underlying pandas array:

    for el in s.array: # if dtype is pandas-only extension
    

    For example, with ~100 unique Categorical values:

    Categorical
    iteration timings for Categorical Series (no index)
    DatetimeTZ Period Interval
    iteration timings for DatetimeTZ Series (no index) iteration timings for Period Series (no index) iteration timings for Interval Series (no index)
  2. To access the index, the idiomatic s.items() is very fast for pandas dtypes:

    for i, el in s.items(): # if need index for pandas-only dtype
    
    DatetimeTZ + index Interval + index Period + index
    iteration timings for DatetimeTZ Series (with index) iteration timings for Interval Series (with index) iteration timings for Period Series (with index)
  3. To micro-optimize, switch to the slightly faster enumerate() for default-indexed Categorical arrays:

    for i, el in enumerate(s.array): # to micro-optimize Categorical dtype if need default range index
    
    Categorical + index
    iteration timings for Categorical Series (with index)

Caveats

  1. Avoid using s.values:

    • Use s.to_numpy() to get the underlying numpy ndarray
    • Use s.array to get the underlying pandas array
  2. Avoid modifying the iterated Series:

    You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!

  3. Avoid iterating manually whenever possible by instead:

    1. Vectorizing, (boolean) indexing, etc.

    2. Applying functions, e.g.:

      • s.apply(some_function)
      • s.agg(['min', 'max', 'mean'])
      • s.transform([np.sqrt, np.exp])

      Note: These are not vectorizations despite the common misconception.

    3. Offloading to cython/numba


Specs: ThinkPad X1 Extreme Gen 3 (Core i7-10850H 2.70GHz, 32GB DDR4 2933MHz)
Versions: python==3.9.2, pandas==1.3.1, numpy==1.20.2
Testing data: Series generation code in snippet

'''
Note: This is python code in a js snippet, so "run code snippet" will not work.
The snippet is just to avoid cluttering the main post with supplemental code.
'''

import pandas as pd
import numpy as np

int_series = pd.Series(np.random.randint(1000000000, size=n))
float_series = pd.Series(np.random.randn(size=n))
floatnan_series = pd.Series(np.random.choice([np.nan, np.inf]*n + np.random.randn(n).tolist(), size=n))
str_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype(str)
string_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype('string')
datetime_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01'), size=n))
datetimetz_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01', tz='CET'), size=n))
categorical_series = pd.Series(np.random.randint(100, size=n)).astype('category')
interval_series = pd.Series(pd.arrays.IntervalArray.from_arrays(-np.random.random(size=n), np.random.random(size=n)))
period_series = pd.Series(pd.period_range(end='2021-01-01', periods=n, freq='s'))
like image 133
tdy Avatar answered Oct 17 '22 21:10

tdy


Use items:

for i, v in arr.items():
    print(f'index: {i} and value: {v}')

Output:

index: 0 and value: 1
index: 1 and value: 1
index: 2 and value: 1
index: 3 and value: 2
index: 4 and value: 2
index: 5 and value: 2
index: 6 and value: 3
index: 7 and value: 3
like image 37
Scott Boston Avatar answered Oct 17 '22 20:10

Scott Boston


The test results are as follows: the execution speed of the loop is the slowest. Iterrows () is optimized for the dataframe of pandas, which is significantly improved compared with the direct loop. The apply () method also loops between rows, but it is much more efficient than iterrows because of a series of global optimizations using iterators like python. The vectorization of numpy arrays runs fastest, followed by the vectorization of pandas series. Since vectorization works on the whole sequence at the same time, it can save more time. Numpy uses precompiled C code to optimize at the bottom, and avoids a lot of overhead in the operation of pandas series. Therefore, the operation of numpy arrays is much faster than that of pandas series.

loop: 1.80301690102 
iterrows: 0.724927186966 
apply: 0.645957946777
pandas series: 0.333024024963 
numpy array: 0.260366916656

loop of the list > numpy array > pandas series > apply > iterrows

like image 1
lazy Avatar answered Oct 17 '22 20:10

lazy


Ways to iterate through pandas/python

arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])

#Using Python range() method
for i in range(len(arr)):
    print(arr[i])

range doesn’t include the end value in the sequence

#List Comprehension
print([arr[i] for i in range(len(arr))])

List comprehension can work with and can identify whether the input is a list, string or tuple

#Using Python enumerate() method
for el,j in enumerate(arr):
    print(j)
#Using Python NumPy module
import numpy as np
print(np.arange(len(arr)))
for i,j in np.ndenumerate(arr):
    print(j)

enumerate is very widely used as enumerate adds a counter to the list or any other iterable and returns it as an enumerate object by the function. It reduces the overhead of keeping a count of the elements while the iteration operation. You wouldn't require a counter here. You could use np.ndenumerate() to mimic the behavior of enumerate for numpy arrays. For very large n-dimensional lists it is advisable to use numpy.

You also use traditional for Loop and also a while Loop

x=0
while x<len(arr):
    print(arr[x])
    x +=1
    
#Using lambda function
list(map(lambda x:x, arr))

lambda reduces the lines of code and can be used along side filter, reduce or map.

If you want to iterate through rows of dataframe rather than the series, we could use iterrows, itertuple and iteritems. The best way in terms of memory and computation is to use the columns as vectors and performing vector computations using numpy arrays. Loops are super expensive when it comes to bigdata. Its easier and quicker when you make them numpy arrays and work on it.

like image 1
Sonia Samipillai Avatar answered Oct 17 '22 22:10

Sonia Samipillai


I believe, the more important is to understand the requirement over cosmetics while looking around a solution for an individual requirement.

In my opinion, it doesn't cost too much until the data we are working on is huge, where we have to be selective in our approach rest for small dataset either approach will be fine as mentioned below..

There are good explanation in PEP 469, PEP 3106 and Views And Iterators Instead Of Lists

In Python 3, there is only one method named items(). It uses iterators so it is fast and allows traversing the dictionary while editing. Note that the method iteritems() was removed from Python 3.

One can have a look at Python3 Wiki Built-In_Changes to get more details on it.

arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])
$ for index, value in arr.items():
   print(f"Index : {index}, Value : {value}")

Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3

$ for index, value in arr.iteritems():
   print(f"Index : {index}, Value : {value}")
   
Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3

$ for _, value in arr.iteritems():
   print(f"Index : {index}, Value : {value}")

Index : 7, Value : 1
Index : 7, Value : 1
Index : 7, Value : 1
Index : 7, Value : 2
Index : 7, Value : 2
Index : 7, Value : 2
Index : 7, Value : 3
Index : 7, Value : 3

$ for i, v in enumerate(arr):
   print(f"Index : {i}, Value : {v}")
Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3

$ for value in arr:
   print(value)

1
1
1
2
2
2
3
3



$ for value in arr.tolist():
   print(value)

1
1
1
2
2
2
3
3

There is a good post about How to iterate over rows in a DataFrame in Pandas though it says df but it explains all about item() , iteritems() etc.

Another good discussion over SO items & iteritems.

like image 1
Karn Kumar Avatar answered Oct 17 '22 22:10

Karn Kumar