I have a DataFrame containing the test runs, dates and outcomes. it looks like this:
TestName;Date;IsPassed
test1;1/31/2017 9:44:30 PM;0
test1;1/31/2017 9:39:00 PM;0
test1;1/31/2017 9:38:29 PM;1
test1;1/31/2017 9:38:27 PM;1
test2;10/31/2016 5:05:02 AM;0
test3;12/7/2016 8:58:36 PM;0
test3;12/7/2016 8:57:19 PM;0
test3;12/7/2016 8:56:15 PM;0
test4;12/5/2016 6:50:49 PM;0
test4;12/5/2016 6:49:50 PM;0
test4;12/5/2016 3:23:09 AM;1
test4;12/4/2016 11:51:29 PM;1
I want to be able to find out the test names that don't have runs before or after a specified date.
Of course I can go like this:
DataFrameBut is there any way to do that natively with Pandas, without explicit for loops?
Update
Based on the solution by @jezrael let's say I want to keep only the test runs that happened only in 2016. Then I have to do like this?
idx = test_runs.groupby('TestName').Date.agg(['idxmax']).stack().unique()
selected = test_runs.loc[idx].Date < pd.to_datetime('2017-01-01')
tests = test_runs.loc[idx].loc[selected].TestName
print(test_runs[test_runs.TestName.isin(tests)])
Output:
TestName Date IsPassed
4 test2 2016-10-31 05:05:02 0
5 test3 2016-12-07 20:58:36 0
6 test3 2016-12-07 20:57:19 0
7 test3 2016-12-07 20:56:15 0
8 test4 2016-12-05 18:50:49 0
9 test4 2016-12-05 18:49:50 0
10 test4 2016-12-05 03:23:09 1
11 test4 2016-12-04 23:51:29 1
I think you need groupby with agg idxmax and
idxmin for return index values of min and max dates, then reshape to Series by stack. Also is necessary remove duplicates by unique for one row groups like test2.
Last select all rows by loc:
df.Date = pd.to_datetime(df.Date)
idx = df.groupby('TestName').Date.agg(['idxmin','idxmax']).stack().unique()
print (idx)
[ 3 0 4 7 5 11 8]
selected = df.loc[idx]
print (selected)
TestName Date IsPassed
3 test1 2017-01-31 21:38:27 1
0 test1 2017-01-31 21:44:30 0
4 test2 2016-10-31 05:05:02 0
7 test3 2016-12-07 20:56:15 0
5 test3 2016-12-07 20:58:36 0
11 test4 2016-12-04 23:51:29 1
8 test4 2016-12-05 18:50:49 0
If need sorted Index add numpy.sort, because output of unique is numpy array.
print (df.loc[np.sort(idx)])
TestName Date IsPassed
0 test1 2017-01-31 21:44:30 0
3 test1 2017-01-31 21:38:27 1
4 test2 2016-10-31 05:05:02 0
5 test3 2016-12-07 20:58:36 0
7 test3 2016-12-07 20:56:15 0
8 test4 2016-12-05 18:50:49 0
11 test4 2016-12-04 23:51:29 1
EDIT:
Your code looks nice, only some improvements was added:
idx = test_runs.groupby('TestName').Date.agg(['idxmin','idxmax']).stack().unique()
#get output to variable, then not need select twice
df1 = test_runs.loc[idx]
#cast to datetime is not necessary
selected = df1['Date'] < '2017-01-01'
#for selecting in DataFrame is used df[index_val, column_name]
tests = df1.loc[selected, 'TestName']
#for better performance in large df was add unique
print(test_runs[test_runs.TestName.isin(tests.unique())])
TestName Date IsPassed
4 test2 2016-10-31 05:05:02 0
5 test3 2016-12-07 20:58:36 0
6 test3 2016-12-07 20:57:19 0
7 test3 2016-12-07 20:56:15 0
8 test4 2016-12-05 18:50:49 0
9 test4 2016-12-05 18:49:50 0
10 test4 2016-12-05 03:23:09 1
11 test4 2016-12-04 23:51:29 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With