How to drop duplicate from a pandas dataframe with some complex conditions?

Question

I am trying to drop duplicates, but based on some conditions. My dataframe looks like this:

idx  a    b    c   d   e  f
1    1   ss1   0   25  A  B
2    3   ss7   0   25  A  B
3    5   ss5   0   12  C  D
4    11  im3   0   12  C  D
5    5   ss8   0   50  C  K
6    9   im8   0   5   F  G
7    8   ix6   0   5   F  G

Rows are considered duplicates if the values of columns d, e and f together match other records in the dataframe subset=['d', 'e', 'f']. For example, rows 1 and 2 are duplicates, rows 3 and 4 are duplicates, and rows 6 and 7 are duplicates. The selection of which row to drop is based on column b.

If the value in column b begins with ss for both duplicates (rows 1 and 2), then anyone can be dropped
If one of the duplicates begins with ss and the other begins with a different format (rows 3 and 4), then the one that begins with ss should be kept.
If both duplicates in column b begin with anything other than ss (rows 6 and 7), then anyone can be selected.

Therefore, the expected output should be something like this:

idx  a    b    c   d   e  f
2    3   ss7   0   25  A  B
3    5   ss5   0   12  C  D
5    5   ss8   0   50  C  K
7    8   ix6   0   5   F  G

Corralien · Accepted Answer

Sort by b key first (everything starts by 'ss' is moved to the end) then drop duplicates from ['d', 'e', 'f'] (keep the last):

out = (df.sort_values('b', key=lambda x: x.str.startswith('ss'))
         .drop_duplicates(['d', 'e', 'f'], keep='last').sort_index())

# OR

out = (df.sort_values('b', key=lambda x: x.str.startswith('ss'))
         .groupby(['d', 'e', 'f'], as_index=False).nth(-1).sort_index())

Output:

>>> out
   idx  a    b  c   d  e  f
1    2  3  ss7  0  25  A  B
2    3  5  ss5  0  12  C  D
4    5  5  ss8  0  50  C  K
6    7  8  ix6  0   5  F  G

How to drop duplicate from a pandas dataframe with some complex conditions?

Tags:

python

python-3.x

pandas

dataframe

duplicates

sam_alloy

1 Answers

Corralien

Recent Activity

Donate For Us

How to drop duplicate from a pandas dataframe with some complex conditions?

Tags:

python

python-3.x

pandas

dataframe

duplicates

sam_alloy

1 Answers

Corralien

Related questions

Recent Activity

Donate For Us