Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to conditionally drop rows in pandas

Tags:

python

pandas

I have the following dataframe:

        True_False  cum_val
Date        
2018-01-02  False   NaN
2018-01-03  False   0.006399
2018-01-04  False   0.010427
2018-01-05  False   0.017461
2018-01-08  False   0.019124
2018-01-09  False   0.020426
2018-01-10  False   0.019314
2018-01-11  False   0.026348
2018-01-12  False   0.033098
2018-01-16  False   0.029573
2018-01-17  False   0.038988
2018-01-18  False   0.037372
2018-01-19  False   0.041757
2018-01-22  False   0.049824
2018-01-23  False   0.051998
2018-01-24  False   0.051438
2018-01-25  False   0.052041
2018-01-26  False   0.063882
2018-01-29  False   0.057150
2018-01-30  True    -0.010899
2018-01-31  True    -0.010410
2018-02-01  True    -0.011058
2018-02-02  True    -0.032266
2018-02-05  True    -0.073246
2018-02-06  True    -0.055805
2018-02-07  True    -0.060806
2018-02-08  True    -0.098343
2018-02-09  True    -0.083407
2018-02-12  False   0.013915
2018-02-13  False   0.016528
2018-02-14  False   0.029930
2018-02-15  False   0.041999
2018-02-16  False   0.042373
2018-02-20  False   0.036531
2018-02-21  False   0.031035
2018-03-06  False   0.013671

How can I drop the row second value after False all the the True values till the second True Value till the second False?

Such as for example:

    True_False  cum_val
Date        
2020-01-21  False   0.022808
2020-01-22  False   0.023097
2020-01-23  True    0.001141
2020-01-24  True    -0.007901 # <- Start drop here since this is the second True
2020-01-27  True    -0.023632
2020-01-28  False -0.013578
2020-01-29  False -0.000867 #< - End Drop Here Since this is the second False
2020-01-30  False 0.003134 

Edit 1:

I would like to add 1 more condition on the new df:

2020-01-22  0.000289    False   
2020-01-23  0.001141    True    
2020-01-27  -0.015731   True    # <- Start Drop Here
2020-01-28  0.010054    True    
2020-01-29  -0.000867   False   
2020-01-30  0.003134    True    #<-End drop here
2020-02-03  0.007255    True    

As you have mentioned in the comment: [True, True, True, False, True]

In this case it would still start the drop at the second True value but would stop the drop right after the first False even though the second value has toggled to True. If the next value is still True drop it till the value after False

like image 236
Slartibartfast Avatar asked Jan 31 '20 05:01

Slartibartfast


People also ask

How to drop rows in a pandas Dataframe based on condition?

We can use the following syntax to drop rows in a pandas DataFrame based on condition: Note: We can also use the drop () function to drop rows from a DataFrame, but this function has been shown to be much slower than just assigning the DataFrame to a filtered version of itself.

How to drop rows with duplicate in pandas python?

Drop Rows with Duplicate in pandas. Delete or Drop rows with condition in python pandas using drop () function. Drop rows by index / position in pandas. Drop NA rows or missing rows in pandas python. labels: String or list of strings referring row. axis: int or string value, 0 ‘index’ for Rows and 1 ‘columns’ for Columns.

How to drop rows by condition in a data set?

Alternatively, you can also try another most used approach to drop rows by condition using loc [] and df []. Note that these methods actually filter the data, by negating this you will get the desired output. # Remove row df2 = df [ df. Fee >= 24000] print( df2) #Using loc [] df2 = df. loc [ df ["Fee"] >= 24000 ] print( df2)

How to remove specific rows from a Dataframe based on condition?

Sometimes you have to remove rows from dataframe based on some specific condition. It can be done by passing the condition df [your_conditon] inside the drop () method. For example, I want to drop rows that have a value greater than 4 of Column A.


4 Answers

Let's try using where with ffill and parameter limit=2 then boolean filtering:

df[~(df['True_False'].where(df['True_False']).ffill(limit=2).cumsum() > 1)]

Output:

|    | Date       | True_False   |   cum_val |
|----|------------|--------------|-----------|
|  0 | 2020-01-21 | False        |         1 |
|  1 | 2020-01-22 | False        |         2 |
|  2 | 2020-01-23 | True         |         3 |
|  7 | 2020-01-28 | False        |         8 |

Details:

  • First let's convert the False to np.nan using where
  • Next, fill first two np.nan after the last True using ffill(limit=2)
  • Now, let's use cumsum so we can add consecutive True and select those greater than 2
  • And negate, to keep false records above the first True record and third False record and on.
like image 149
Scott Boston Avatar answered Oct 18 '22 22:10

Scott Boston


Here's what I tried. The data I created is:

    Date    True_False  cum_val
0   2020-01-21  False   1
1   2020-01-22  False   2
2   2020-01-23  True    3
3   2020-01-24  True    4
4   2020-01-25  True    5
5   2020-01-26  False   6
6   2020-01-27  False   7
7   2020-01-28  False   8

true_count = 0
false_count = 0
drop_continue = False
for index, row in df.iterrows():
    if row['True_False'] is True and drop_continue is False:
        true_count +=1
        if true_count == 2:
            drop_continue = True
            df.drop(index, inplace=True)
            true_count = 0
            continue
    if drop_continue is True:
        if row['True_False'] is True:
            df.drop(index, inplace=True)
        if row['True_False'] is False:
            false_count += 1
            if false_count <2:
                df.drop(index, inplace=True)
            else:
                drop_continue = False
                false_count = 0

Output

    Date    True_False  cum_val
0   2020-01-21  False   1
1   2020-01-22  False   2
2   2020-01-23  True    3
6   2020-01-27  False   7
7   2020-01-28  False   8

like image 29
Vishakha Lall Avatar answered Oct 18 '22 21:10

Vishakha Lall


You could use Series.Shift and Series.bfill:

df = df[~df['True_False'].shift().bfill()]

print(df)                                                               
         Date  True_False   cum_val
0  2020-01-21       False  0.022808
1  2020-01-22       False  0.023097
2  2020-01-23        True  0.001141
6  2020-01-29       False -0.000867
7  2020-01-30       False  0.003134
like image 2
dkhara Avatar answered Oct 18 '22 22:10

dkhara


You can do:

#mark start of the area you want to drop
df["dropit"]=np.where(df["True_False"] & df["True_False"].shift(1) & np.logical_not(df["True_False"].shift(2)), "start", None)

#mark the end of the drop area
df["dropit"]=np.where(np.logical_not(df["True_False"].shift(1)) & df["True_False"].shift(2), "end", df["dropit"])

#indicate gaps between the different drop areas:
df.loc[df["dropit"].shift().eq("end")&df["dropit"].ne("start"), "dropit"]="keep"

#forward fill
df["dropit"]=df["dropit"].ffill()

#drop marked drop areas and drop "dropit" column
df=df.drop(df.loc[df["dropit"].isin(["start", "end"])].index, axis=0).drop("dropit", axis=1)

Outputs:

            True_False   cum_val
Date
2018-01-02       False       NaN
2018-01-03       False  0.006399
2018-01-04       False  0.010427
2018-01-05       False  0.017461
2018-01-08       False  0.019124
2018-01-09       False  0.020426
2018-01-10       False  0.019314
2018-01-11       False  0.026348
2018-01-12       False  0.033098
2018-01-16       False  0.029573
2018-01-17       False  0.038988
2018-01-18       False  0.037372
2018-01-19       False  0.041757
2018-01-22       False  0.049824
2018-01-23       False  0.051998
2018-01-24       False  0.051438
2018-01-25       False  0.052041
2018-01-26       False  0.063882
2018-01-29       False  0.057150
2018-01-30        True -0.010899
2018-02-14       False  0.029930
2018-02-15       False  0.041999
2018-02-16       False  0.042373
2018-02-20       False  0.036531
2018-02-21       False  0.031035
2018-03-06       False  0.013671
like image 1
Grzegorz Skibinski Avatar answered Oct 18 '22 21:10

Grzegorz Skibinski