I have a Pandas dataframe containing two columns: a datetime column, and a column of integers representing station IDs. I need a new dataframe with the following modifications:
For each set of duplicate STATION_ID
values, keep the row with the most recent entry for DATE_CHANGED
. If the duplicate entries for the STATION_ID
all contain the same DATE_CHANGED
then drop the duplicates and retain a single row for the STATION_ID
. If there are no duplicates for the STATION_ID
value, simply retain the row.
Dataframe (sorted by STATION_ID
):
DATE_CHANGED STATION_ID 0 2006-06-07 06:00:00 1 1 2000-09-26 06:00:00 1 2 2000-09-26 06:00:00 1 3 2000-09-26 06:00:00 1 4 2001-06-06 06:00:00 2 5 2005-07-29 06:00:00 2 6 2005-07-29 06:00:00 2 7 2001-06-06 06:00:00 2 8 2001-06-08 06:00:00 4 9 2003-11-25 07:00:00 4 10 2001-06-12 06:00:00 7 11 2001-06-04 06:00:00 8 12 2017-04-03 18:36:16 8 13 2017-04-03 18:36:16 8 14 2017-04-03 18:36:16 8 15 2001-06-04 06:00:00 8 16 2001-06-08 06:00:00 10 17 2001-06-08 06:00:00 10 18 2001-06-08 06:00:00 11 19 2001-06-08 06:00:00 11 20 2001-06-08 06:00:00 12 21 2001-06-08 06:00:00 12 22 2001-06-08 06:00:00 13 23 2001-06-08 06:00:00 13 24 2001-06-08 06:00:00 14 25 2001-06-08 06:00:00 14 26 2001-06-08 06:00:00 15 27 2017-08-07 17:48:25 15 28 2001-06-08 06:00:00 15 29 2017-08-07 17:48:25 15 ... ... ... 157066 2018-08-06 14:11:28 71655 157067 2018-08-06 14:11:28 71656 157068 2018-08-06 14:11:28 71656 157069 2018-09-11 21:45:05 71664 157070 2018-09-11 21:45:05 71664 157071 2018-09-11 21:45:05 71664 157072 2018-09-11 21:41:04 71664 157073 2018-08-09 15:22:07 71720 157074 2018-08-09 15:22:07 71720 157075 2018-08-09 15:22:07 71720 157076 2018-08-23 12:43:12 71899 157077 2018-08-23 12:43:12 71899 157078 2018-08-23 12:43:12 71899 157079 2018-09-08 20:21:43 71969 157080 2018-09-08 20:21:43 71969 157081 2018-09-08 20:21:43 71969 157082 2018-09-08 20:21:43 71984 157083 2018-09-08 20:21:43 71984 157084 2018-09-08 20:21:43 71984 157085 2018-09-05 18:46:18 71985 157086 2018-09-05 18:46:18 71985 157087 2018-09-05 18:46:18 71985 157088 2018-09-08 20:21:44 71990 157089 2018-09-08 20:21:44 71990 157090 2018-09-08 20:21:44 71990 157091 2018-09-08 20:21:43 72003 157092 2018-09-08 20:21:43 72003 157093 2018-09-08 20:21:43 72003 157094 2018-09-10 17:06:18 72024 157095 2018-09-10 17:15:05 72024 [157096 rows x 2 columns]
DATE_CHANGED
is dtype: datetime64[ns]
STATION_ID
is dtype: int64
pandas==0.23.4
python==2.7.15
Drop duplicates but keep first drop_duplicates() . The rows that contain the same values in all the columns then are identified as duplicates. If the row is duplicated then by default DataFrame. drop_duplicates() keeps the first occurrence of that row and drops all other duplicates of it.
keep: allowed values are {'first', 'last', False}, default 'first'. If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted.
Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.
Try:
df.sort_values('DATE_CHANGED').drop_duplicates('STATION_ID',keep='last')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With