Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas drop consecutive duplicate rows only, ignoring specific columns

I have a dataframe below

df = pd.DataFrame({
    'ID': ['James', 'James', 'James', 'James',
           'Max', 'Max', 'Max', 'Max', 'Max',
           'Park', 'Park','Park', 'Park',
           'Tom', 'Tom', 'Tom', 'Tom'],
    'From_num': [578, 420, 420, 'Started', 298, 78, 36, 298, 'Started', 28, 28, 311, 'Started', 60, 520, 99, 'Started'],
    'To_num': [96, 578, 578, 420, 36, 298, 78, 36, 298, 112, 112, 28, 311, 150, 60, 520, 99],
    'Date': ['2020-05-12', '2020-02-02', '2020-02-01', '2019-06-18',
             '2019-08-26', '2019-06-20', '2019-01-30', '2018-10-23',
             '2018-08-29', '2020-05-21', '2020-05-20', '2019-11-22',
             '2019-04-12', '2019-10-16', '2019-08-26', '2018-12-11', '2018-10-09']})

and it is like this:

       ID From_num  To_num        Date
0   James      578      96  2020-05-12
1   James      420     578  2020-02-02
2   James      420     578  2020-02-01 # Drop the this duplicated row (ignore date)
3   James  Started     420  2019-06-18
4     Max      298      36  2019-08-26
5     Max       78     298  2019-06-20
6     Max       36      78  2019-01-30
7     Max      298      36  2018-10-23
8     Max  Started     298  2018-08-29
9    Park       28     112  2020-05-21
10   Park       28     112  2020-05-20 # Drop this duplicate row (ignore date)
11   Park      311      28  2019-11-22
12   Park  Started     311  2019-04-12
13    Tom       60     150  2019-10-16
14    Tom      520      60  2019-08-26
15    Tom       99     520  2018-12-11
16    Tom  Started      99  2018-10-09

There are some consecutive duplicated values (ignore the Date value) within each 'ID'(Name), e.g. line 1 and 2 for James, the From_num are both 420, same as line 9 and 10, I wish to drop the 2nd duplicated row and keep the first. I wrote loop conditions, but it is very redundant and slow, I assume there might be easier way to do this, so please help if you have ideas. Great thanks. The expected result is like this:

       ID  From_num  To_num    Date
0   James      578      96  2020-05-12
1   James      420     578  2020-02-02
2   James  Started     420  2019-06-18
3     Max      298      36  2019-08-26
4     Max       78     298  2019-06-20
5     Max       36      78  2019-01-30
6     Max      298      36  2018-10-23
7     Max  Started     298  2018-08-29
8    Park       28     112  2020-05-21
9    Park      311      28  2019-11-22
10   Park  Started     311  2019-04-12
11    Tom       60     150  2019-10-16
12    Tom      520      60  2019-08-26
13    Tom       99     520  2018-12-11
14    Tom  Started      99  2018-10-09
like image 920
Alice jinx Avatar asked Nov 15 '22 07:11

Alice jinx


1 Answers

It's a bit late, but does this do what you wanted? This drops consecutive duplicates ignoring "Date".

t = df[['ID', 'From_num', 'To_num']]     
df[(t.ne(t.shift())).any(axis=1)]

       ID From_num  To_num        Date
0   James      578      96  2020-05-12
1   James      420     578  2020-02-02
3   James  Started     420  2019-06-18
4     Max      298      36  2019-08-26
5     Max       78     298  2019-06-20
6     Max       36      78  2019-01-30
7     Max      298      36  2018-10-23
8     Max  Started     298  2018-08-29
9    Park       28     112  2020-05-21
11   Park      311      28  2019-11-22
12   Park  Started     311  2019-04-12
13    Tom       60     150  2019-10-16
14    Tom      520      60  2019-08-26
15    Tom       99     520  2018-12-11
16    Tom  Started      99  2018-10-09

This drops rows with index values 2 and 10.

like image 114
cs95 Avatar answered Nov 18 '22 10:11

cs95