Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to iterate through a pandas dataframe?

How do I run through a dataframe and return only the rows which meet a certain condition? This condition has to be tested on previous rows and columns. For example:

          #1    #2    #3    #4
1/1/1999   4     2     4     5
1/2/1999   5     2     3     3
1/3/1999   5     2     3     8
1/4/1999   6     4     2     6
1/5/1999   8     3     4     7
1/6/1999   3     2     3     8
1/7/1999   1     3     4     1

I could like to test a few conditions for each row and if all conditions are passed I would like to append the row to list. For example:

for row in dataframe:
    if [row-1, column 0] + [row-2, column 3] >= 6:
        append row to a list

I may have up to 3 conditions which must be true for the row to be returned. The way am thinking about doing it is by making a list for all the observations which are true for each condition, and then making a separate list for all of the rows that appear in all three lists.

My two questions are the following:

What is the fastest way to get all of the rows that meet a certain condition based on previous rows? Looping through a dataframe of 5,000 rows seems like it may be too long. Especially if potentially 3 conditions have to be tested.

What is the best way to get a list of rows which meet all 3 conditions?

like image 915
user1367204 Avatar asked Oct 15 '13 21:10

user1367204


People also ask

What is the fastest way to iterate over Pandas DataFrame?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

What is faster for looping over a DataFrame Iterrows or Itertuples?

itertuples() method. The main difference between this method and iterrows is that this method is faster than the iterrows method as well as it also preserve the data type of a column compared to the iterrows method which don't as it returns a Series for each row but dtypes are preserved across columns.

How can I make my Pandas 100x faster?

apply() function to speed it up over 100x. This article takes Pandas' standard dataframe. apply function and upgrades it with a bit of Cython to speed up execution from 3 minutes to under 2 seconds.


1 Answers

The quickest way to select rows is to not iterate through the rows of the dataframe. Instead, create a mask (boolean array) with True values for the rows you wish to select, and then call df[mask] to select them:

mask = (df['column 0'].shift(1) + df['column 3'].shift(2) >= 6)
newdf = df[mask]

To combine more than one condition with logical-and, use &:

mask = ((...) & (...))

For logical-or use |:

mask = ((...) | (...))

For example,

In [75]: df = pd.DataFrame({'A':range(5), 'B':range(10,20,2)})

In [76]: df
Out[76]: 
   A   B
0  0  10
1  1  12
2  2  14
3  3  16
4  4  18

In [77]: mask = (df['A'].shift(1) + df['B'].shift(2) > 12)

In [78]: mask
Out[78]: 
0    False
1    False
2    False
3     True
4     True
dtype: bool

In [79]: df[mask]
Out[79]: 
   A   B
3  3  16
4  4  18
like image 169
unutbu Avatar answered Sep 22 '22 07:09

unutbu