DataFrame. iterrows() method is used to iterate over DataFrame rows as (index, Series) pairs. Note that this method does not preserve the dtypes across rows due to the fact that this method will convert each row into a Series .
Using DataFrame.iterrows() is used to iterate over DataFrame rows. This returns (index, Series) where the index is an index of the Row and Series is data or content of each row. To get the data from the series, you should use the column name like row["Fee"] .
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
By using apply and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.
DataFrame.iterrows
is a generator which yields both the index and row (as a Series):
import pandas as pd
df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})
for index, row in df.iterrows():
print(row['c1'], row['c2'])
10 100
11 110
12 120
How to iterate over rows in a DataFrame in Pandas?
Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter
" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.
Do you want to print a DataFrame? Use DataFrame.to_string()
.
Do you want to compute something? In that case, search for methods in this order (list modified from here):
for
loop)DataFrame.apply()
: i) Reductions that can be performed in Cython, ii) Iteration in Python spaceDataFrame.itertuples()
and iteritems()
DataFrame.iterrows()
iterrows
and itertuples
(both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.
Appeal to Authority
The documentation page on iteration has a huge red warning box that says:
Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].
* It's actually a little more complicated than "don't". df.iterrows()
is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you're not sure whether you need an iterative solution, you probably don't. PS: To know more about my rationale for writing this answer, skip to the very bottom.
A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.
If none exists, feel free to write your own using custom Cython extensions.
List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you're trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.
The formula is simple,
# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df['col1'], ..., df['coln'])]
If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.
Caveats
List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don't have NaNs, but this cannot always be guaranteed.
zip(df['A'], df['B'], ...)
instead of df[['A', 'B']].to_numpy()
as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy()
will cast the entire array to string, which may not be what you want. Fortunately zip
ping your columns together is the most straightforward workaround to this.*Your mileage may vary for the reasons outlined in the Caveats section above.
Let's demonstrate the difference with a simple example of adding two pandas columns A + B
. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.
Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you're doing. Stick to the API where you can (i.e., prefer vec
over vec_numpy
).
I should mention, however, that it isn't always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.
Most of the analyses performed on the various alternatives to the iter family has been through the lens of performance. However, in most situations you will typically be working on a reasonably sized dataset (nothing beyond a few thousand or 100K rows) and performance will come second to simplicity/readability of the solution.
Here is my personal preference when selecting a method to use for a problem.
For the novice:
Vectorization (when possible);
apply()
; List Comprehensions;itertuples()
/iteritems()
;iterrows()
; Cython
For the more experienced:
Vectorization (when possible);
apply()
; List Comprehensions; Cython;itertuples()
/iteritems()
;iterrows()
Vectorization prevails as the most idiomatic method for any problem that can be vectorized. Always seek to vectorize! When in doubt, consult the docs, or look on Stack Overflow for an existing question on your particular task.
I do tend to go on about how bad apply
is in a lot of my posts, but I do concede it is easier for a beginner to wrap their head around what it's doing. Additionally, there are quite a few use cases for apply
has explained in this post of mine.
Cython ranks lower down on the list because it takes more time and effort to pull off correctly. You will usually never need to write code with pandas that demands this level of performance that even a list comprehension cannot satisfy.
* As with any personal opinion, please take with heaps of salt!
10 Minutes to pandas, and Essential Basic Functionality - Useful links that introduce you to Pandas and its library of vectorized*/cythonized functions.
Enhancing Performance - A primer from the documentation on enhancing standard Pandas operations
Are for-loops in pandas really bad? When should I care? - a detailed writeup by me on list comprehensions and their suitability for various operations (mainly ones involving non-numeric data)
When should I (not) want to use pandas apply() in my code? - apply
is slow (but not as slow as the iter*
family. There are, however, situations where one can (or should) consider apply
as a serious alternative, especially in some GroupBy
operations).
* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.
A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows()
while doing something inside a for
loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.
The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I'm not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.
First consider if you really need to iterate over rows in a DataFrame. See this answer for alternatives.
If you still need to iterate over rows, you can use methods below. Note some important caveats which are not mentioned in any of the other answers.
DataFrame.iterrows()
for index, row in df.iterrows():
print(row["c1"], row["c2"])
DataFrame.itertuples()
for row in df.itertuples(index=True, name='Pandas'):
print(row.c1, row.c2)
itertuples()
is supposed to be faster than iterrows()
But be aware, according to the docs (pandas 0.24.2 at the moment):
dtype
might not match from row to rowBecause iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally much faster than iterrows()
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Use DataFrame.apply() instead:
new_df = df.apply(lambda x: x * 2)
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.
See pandas docs on iteration for more details.
You should use df.iterrows()
. Though iterating row-by-row is not especially efficient since Series
objects have to be created.
While iterrows()
is a good option, sometimes itertuples()
can be much faster:
df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})
%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop
%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With