I have a dataframe below
df = pd.DataFrame({
'ID': ['James', 'James', 'James', 'James',
'Max', 'Max', 'Max', 'Max', 'Max',
'Park', 'Park','Park', 'Park',
'Tom', 'Tom', 'Tom', 'Tom'],
'From_num': [578, 420, 420, 'Started', 298, 78, 36, 298, 'Started', 28, 28, 311, 'Started', 60, 520, 99, 'Started'],
'To_num': [96, 578, 578, 420, 36, 298, 78, 36, 298, 112, 112, 28, 311, 150, 60, 520, 99],
'Date': ['2020-05-12', '2020-02-02', '2020-02-01', '2019-06-18',
'2019-08-26', '2019-06-20', '2019-01-30', '2018-10-23',
'2018-08-29', '2020-05-21', '2020-05-20', '2019-11-22',
'2019-04-12', '2019-10-16', '2019-08-26', '2018-12-11', '2018-10-09']})
and it is like this:
ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
2 James 420 578 2020-02-01 # Drop the this duplicated row (ignore date)
3 James Started 420 2019-06-18
4 Max 298 36 2019-08-26
5 Max 78 298 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park 28 112 2020-05-21
10 Park 28 112 2020-05-20 # Drop this duplicate row (ignore date)
11 Park 311 28 2019-11-22
12 Park Started 311 2019-04-12
13 Tom 60 150 2019-10-16
14 Tom 520 60 2019-08-26
15 Tom 99 520 2018-12-11
16 Tom Started 99 2018-10-09
There are some consecutive duplicated values (ignore the Date value) within each 'ID'(Name), e.g. line 1 and 2 for James, the From_num are both 420, same as line 9 and 10, I wish to drop the 2nd duplicated row and keep the first. I wrote loop conditions, but it is very redundant and slow, I assume there might be easier way to do this, so please help if you have ideas. Great thanks. The expected result is like this:
ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-08-26
4 Max 78 298 2019-06-20
5 Max 36 78 2019-01-30
6 Max 298 36 2018-10-23
7 Max Started 298 2018-08-29
8 Park 28 112 2020-05-21
9 Park 311 28 2019-11-22
10 Park Started 311 2019-04-12
11 Tom 60 150 2019-10-16
12 Tom 520 60 2019-08-26
13 Tom 99 520 2018-12-11
14 Tom Started 99 2018-10-09
It's a bit late, but does this do what you wanted? This drops consecutive duplicates ignoring "Date".
t = df[['ID', 'From_num', 'To_num']]
df[(t.ne(t.shift())).any(axis=1)]
ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-08-26
5 Max 78 298 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park 28 112 2020-05-21
11 Park 311 28 2019-11-22
12 Park Started 311 2019-04-12
13 Tom 60 150 2019-10-16
14 Tom 520 60 2019-08-26
15 Tom 99 520 2018-12-11
16 Tom Started 99 2018-10-09
This drops rows with index values 2 and 10.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With