Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop duplicates, keep most recent date, Pandas dataframe

Tags:

I have a Pandas dataframe containing two columns: a datetime column, and a column of integers representing station IDs. I need a new dataframe with the following modifications:

For each set of duplicate STATION_ID values, keep the row with the most recent entry for DATE_CHANGED. If the duplicate entries for the STATION_ID all contain the same DATE_CHANGED then drop the duplicates and retain a single row for the STATION_ID. If there are no duplicates for the STATION_ID value, simply retain the row.

Dataframe (sorted by STATION_ID):

              DATE_CHANGED  STATION_ID 0      2006-06-07 06:00:00           1 1      2000-09-26 06:00:00           1 2      2000-09-26 06:00:00           1 3      2000-09-26 06:00:00           1 4      2001-06-06 06:00:00           2 5      2005-07-29 06:00:00           2 6      2005-07-29 06:00:00           2 7      2001-06-06 06:00:00           2 8      2001-06-08 06:00:00           4 9      2003-11-25 07:00:00           4 10     2001-06-12 06:00:00           7 11     2001-06-04 06:00:00           8 12     2017-04-03 18:36:16           8 13     2017-04-03 18:36:16           8 14     2017-04-03 18:36:16           8 15     2001-06-04 06:00:00           8 16     2001-06-08 06:00:00          10 17     2001-06-08 06:00:00          10 18     2001-06-08 06:00:00          11 19     2001-06-08 06:00:00          11 20     2001-06-08 06:00:00          12 21     2001-06-08 06:00:00          12 22     2001-06-08 06:00:00          13 23     2001-06-08 06:00:00          13 24     2001-06-08 06:00:00          14 25     2001-06-08 06:00:00          14 26     2001-06-08 06:00:00          15 27     2017-08-07 17:48:25          15 28     2001-06-08 06:00:00          15 29     2017-08-07 17:48:25          15 ...                    ...         ... 157066 2018-08-06 14:11:28       71655 157067 2018-08-06 14:11:28       71656 157068 2018-08-06 14:11:28       71656 157069 2018-09-11 21:45:05       71664 157070 2018-09-11 21:45:05       71664 157071 2018-09-11 21:45:05       71664 157072 2018-09-11 21:41:04       71664 157073 2018-08-09 15:22:07       71720 157074 2018-08-09 15:22:07       71720 157075 2018-08-09 15:22:07       71720 157076 2018-08-23 12:43:12       71899 157077 2018-08-23 12:43:12       71899 157078 2018-08-23 12:43:12       71899 157079 2018-09-08 20:21:43       71969 157080 2018-09-08 20:21:43       71969 157081 2018-09-08 20:21:43       71969 157082 2018-09-08 20:21:43       71984 157083 2018-09-08 20:21:43       71984 157084 2018-09-08 20:21:43       71984 157085 2018-09-05 18:46:18       71985 157086 2018-09-05 18:46:18       71985 157087 2018-09-05 18:46:18       71985 157088 2018-09-08 20:21:44       71990 157089 2018-09-08 20:21:44       71990 157090 2018-09-08 20:21:44       71990 157091 2018-09-08 20:21:43       72003 157092 2018-09-08 20:21:43       72003 157093 2018-09-08 20:21:43       72003 157094 2018-09-10 17:06:18       72024 157095 2018-09-10 17:15:05       72024  [157096 rows x 2 columns] 

DATE_CHANGED is dtype: datetime64[ns]

STATION_ID is dtype: int64

pandas==0.23.4

python==2.7.15

like image 232
PJW Avatar asked Sep 18 '18 23:09

PJW


People also ask

Does pandas drop duplicates keep first?

Drop duplicates but keep first drop_duplicates() . The rows that contain the same values in all the columns then are identified as duplicates. If the row is duplicated then by default DataFrame. drop_duplicates() keeps the first occurrence of that row and drops all other duplicates of it.

What is keep =' last in Python?

keep: allowed values are {'first', 'last', False}, default 'first'. If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted.

How do I drop duplicates in pandas DataFrame?

Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.


1 Answers

Try:

df.sort_values('DATE_CHANGED').drop_duplicates('STATION_ID',keep='last') 
like image 103
sacuL Avatar answered Sep 19 '22 13:09

sacuL