How do I run through a dataframe and return only the rows which meet a certain condition? This condition has to be tested on previous rows and columns. For example:
#1 #2 #3 #4
1/1/1999 4 2 4 5
1/2/1999 5 2 3 3
1/3/1999 5 2 3 8
1/4/1999 6 4 2 6
1/5/1999 8 3 4 7
1/6/1999 3 2 3 8
1/7/1999 1 3 4 1
I could like to test a few conditions for each row and if all conditions are passed I would like to append the row to list. For example:
for row in dataframe:
if [row-1, column 0] + [row-2, column 3] >= 6:
append row to a list
I may have up to 3 conditions which must be true for the row to be returned. The way am thinking about doing it is by making a list for all the observations which are true for each condition, and then making a separate list for all of the rows that appear in all three lists.
My two questions are the following:
What is the fastest way to get all of the rows that meet a certain condition based on previous rows? Looping through a dataframe of 5,000 rows seems like it may be too long. Especially if potentially 3 conditions have to be tested.
What is the best way to get a list of rows which meet all 3 conditions?
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
itertuples() method. The main difference between this method and iterrows is that this method is faster than the iterrows method as well as it also preserve the data type of a column compared to the iterrows method which don't as it returns a Series for each row but dtypes are preserved across columns.
apply() function to speed it up over 100x. This article takes Pandas' standard dataframe. apply function and upgrades it with a bit of Cython to speed up execution from 3 minutes to under 2 seconds.
The quickest way to select rows is to not iterate through the rows of the dataframe. Instead, create a mask (boolean array) with True values for the rows you wish to select, and then call df[mask]
to select them:
mask = (df['column 0'].shift(1) + df['column 3'].shift(2) >= 6)
newdf = df[mask]
To combine more than one condition with logical-and, use &
:
mask = ((...) & (...))
For logical-or use |
:
mask = ((...) | (...))
For example,
In [75]: df = pd.DataFrame({'A':range(5), 'B':range(10,20,2)})
In [76]: df
Out[76]:
A B
0 0 10
1 1 12
2 2 14
3 3 16
4 4 18
In [77]: mask = (df['A'].shift(1) + df['B'].shift(2) > 12)
In [78]: mask
Out[78]:
0 False
1 False
2 False
3 True
4 True
dtype: bool
In [79]: df[mask]
Out[79]:
A B
3 3 16
4 4 18
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With