Filter entries of pandas dataframe based on presence (absence) of entries older than a date

Question

I have a DataFrame containing the test runs, dates and outcomes. it looks like this:

TestName;Date;IsPassed
test1;1/31/2017 9:44:30 PM;0
test1;1/31/2017 9:39:00 PM;0
test1;1/31/2017 9:38:29 PM;1
test1;1/31/2017 9:38:27 PM;1
test2;10/31/2016 5:05:02 AM;0
test3;12/7/2016 8:58:36 PM;0
test3;12/7/2016 8:57:19 PM;0
test3;12/7/2016 8:56:15 PM;0
test4;12/5/2016 6:50:49 PM;0
test4;12/5/2016 6:49:50 PM;0
test4;12/5/2016 3:23:09 AM;1
test4;12/4/2016 11:51:29 PM;1

I want to be able to find out the test names that don't have runs before or after a specified date.

Of course I can go like this:

Identify all unique test names
For each of them figure out their min and max dates
Based on those add the respective rows to a new DataFrame

But is there any way to do that natively with Pandas, without explicit for loops?

Update

Based on the solution by @jezrael let's say I want to keep only the test runs that happened only in 2016. Then I have to do like this?

idx = test_runs.groupby('TestName').Date.agg(['idxmax']).stack().unique()
selected = test_runs.loc[idx].Date < pd.to_datetime('2017-01-01')
tests = test_runs.loc[idx].loc[selected].TestName
print(test_runs[test_runs.TestName.isin(tests)])

Output:

TestName                Date  IsPassed
4     test2 2016-10-31 05:05:02         0
5     test3 2016-12-07 20:58:36         0
6     test3 2016-12-07 20:57:19         0
7     test3 2016-12-07 20:56:15         0
8     test4 2016-12-05 18:50:49         0
9     test4 2016-12-05 18:49:50         0
10    test4 2016-12-05 03:23:09         1
11    test4 2016-12-04 23:51:29         1

jezrael · Accepted Answer

I think you need groupby with agg idxmax and idxmin for return index values of min and max dates, then reshape to Series by stack. Also is necessary remove duplicates by unique for one row groups like test2.

Last select all rows by loc:

df.Date = pd.to_datetime(df.Date)
idx = df.groupby('TestName').Date.agg(['idxmin','idxmax']).stack().unique()
print (idx)
[ 3  0  4  7  5 11  8]

selected = df.loc[idx]
print (selected)
   TestName                Date  IsPassed
3     test1 2017-01-31 21:38:27         1
0     test1 2017-01-31 21:44:30         0
4     test2 2016-10-31 05:05:02         0
7     test3 2016-12-07 20:56:15         0
5     test3 2016-12-07 20:58:36         0
11    test4 2016-12-04 23:51:29         1
8     test4 2016-12-05 18:50:49         0

If need sorted Index add numpy.sort, because output of unique is numpy array.

print (df.loc[np.sort(idx)])
   TestName                Date  IsPassed
0     test1 2017-01-31 21:44:30         0
3     test1 2017-01-31 21:38:27         1
4     test2 2016-10-31 05:05:02         0
5     test3 2016-12-07 20:58:36         0
7     test3 2016-12-07 20:56:15         0
8     test4 2016-12-05 18:50:49         0
11    test4 2016-12-04 23:51:29         1

EDIT:

Your code looks nice, only some improvements was added:

idx = test_runs.groupby('TestName').Date.agg(['idxmin','idxmax']).stack().unique()
#get output to variable, then not need select twice  
df1 = test_runs.loc[idx]
#cast to datetime is not necessary
selected = df1['Date'] < '2017-01-01'
#for selecting in DataFrame is used df[index_val, column_name]
tests = df1.loc[selected, 'TestName']
#for better performance in large df was add unique
print(test_runs[test_runs.TestName.isin(tests.unique())])
   TestName                Date  IsPassed
4     test2 2016-10-31 05:05:02         0
5     test3 2016-12-07 20:58:36         0
6     test3 2016-12-07 20:57:19         0
7     test3 2016-12-07 20:56:15         0
8     test4 2016-12-05 18:50:49         0
9     test4 2016-12-05 18:49:50         0
10    test4 2016-12-05 03:23:09         1
11    test4 2016-12-04 23:51:29         1

Filter entries of pandas dataframe based on presence (absence) of entries older than a date

Tags:

python

pandas

dataframe

Nick Slavsky

1 Answers

jezrael

Recent Activity

Donate For Us

Filter entries of pandas dataframe based on presence (absence) of entries older than a date

Tags:

python

pandas

dataframe

Nick Slavsky

1 Answers

jezrael

Related questions

Recent Activity

Donate For Us