Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter entries of pandas dataframe based on presence (absence) of entries older than a date

I have a DataFrame containing the test runs, dates and outcomes. it looks like this:

TestName;Date;IsPassed
test1;1/31/2017 9:44:30 PM;0
test1;1/31/2017 9:39:00 PM;0
test1;1/31/2017 9:38:29 PM;1
test1;1/31/2017 9:38:27 PM;1
test2;10/31/2016 5:05:02 AM;0
test3;12/7/2016 8:58:36 PM;0
test3;12/7/2016 8:57:19 PM;0
test3;12/7/2016 8:56:15 PM;0
test4;12/5/2016 6:50:49 PM;0
test4;12/5/2016 6:49:50 PM;0
test4;12/5/2016 3:23:09 AM;1
test4;12/4/2016 11:51:29 PM;1

I want to be able to find out the test names that don't have runs before or after a specified date.

Of course I can go like this:

  • Identify all unique test names
  • For each of them figure out their min and max dates
  • Based on those add the respective rows to a new DataFrame

But is there any way to do that natively with Pandas, without explicit for loops?

Update

Based on the solution by @jezrael let's say I want to keep only the test runs that happened only in 2016. Then I have to do like this?

idx = test_runs.groupby('TestName').Date.agg(['idxmax']).stack().unique()
selected = test_runs.loc[idx].Date < pd.to_datetime('2017-01-01')
tests = test_runs.loc[idx].loc[selected].TestName
print(test_runs[test_runs.TestName.isin(tests)])

Output:

TestName                Date  IsPassed
4     test2 2016-10-31 05:05:02         0
5     test3 2016-12-07 20:58:36         0
6     test3 2016-12-07 20:57:19         0
7     test3 2016-12-07 20:56:15         0
8     test4 2016-12-05 18:50:49         0
9     test4 2016-12-05 18:49:50         0
10    test4 2016-12-05 03:23:09         1
11    test4 2016-12-04 23:51:29         1
like image 214
Nick Slavsky Avatar asked Dec 21 '25 11:12

Nick Slavsky


1 Answers

I think you need groupby with agg idxmax and idxmin for return index values of min and max dates, then reshape to Series by stack. Also is necessary remove duplicates by unique for one row groups like test2.

Last select all rows by loc:

df.Date = pd.to_datetime(df.Date)
idx = df.groupby('TestName').Date.agg(['idxmin','idxmax']).stack().unique()
print (idx)
[ 3  0  4  7  5 11  8]

selected = df.loc[idx]
print (selected)
   TestName                Date  IsPassed
3     test1 2017-01-31 21:38:27         1
0     test1 2017-01-31 21:44:30         0
4     test2 2016-10-31 05:05:02         0
7     test3 2016-12-07 20:56:15         0
5     test3 2016-12-07 20:58:36         0
11    test4 2016-12-04 23:51:29         1
8     test4 2016-12-05 18:50:49         0

If need sorted Index add numpy.sort, because output of unique is numpy array.

print (df.loc[np.sort(idx)])
   TestName                Date  IsPassed
0     test1 2017-01-31 21:44:30         0
3     test1 2017-01-31 21:38:27         1
4     test2 2016-10-31 05:05:02         0
5     test3 2016-12-07 20:58:36         0
7     test3 2016-12-07 20:56:15         0
8     test4 2016-12-05 18:50:49         0
11    test4 2016-12-04 23:51:29         1

EDIT:

Your code looks nice, only some improvements was added:

idx = test_runs.groupby('TestName').Date.agg(['idxmin','idxmax']).stack().unique()
#get output to variable, then not need select twice  
df1 = test_runs.loc[idx]
#cast to datetime is not necessary
selected = df1['Date'] < '2017-01-01'
#for selecting in DataFrame is used df[index_val, column_name]
tests = df1.loc[selected, 'TestName']
#for better performance in large df was add unique
print(test_runs[test_runs.TestName.isin(tests.unique())])
   TestName                Date  IsPassed
4     test2 2016-10-31 05:05:02         0
5     test3 2016-12-07 20:58:36         0
6     test3 2016-12-07 20:57:19         0
7     test3 2016-12-07 20:56:15         0
8     test4 2016-12-05 18:50:49         0
9     test4 2016-12-05 18:49:50         0
10    test4 2016-12-05 03:23:09         1
11    test4 2016-12-04 23:51:29         1
like image 93
jezrael Avatar answered Dec 23 '25 02:12

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!