Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop duplicates based on majority rule

I have a table that looks like this:

A  B
1  cat
1  cat
1  dog
2  illama
2  alpaca
3  donkey

Using A as the key, I'd like to remove duplicates such that that dataframe becomes:

A  B
1  cat
3  donkey

1 is duplicated three times, the value cat occurs the most so it is recorded. there is no majority for 2 so it is considered ambiguous and removed completely. 3 remains as it has no duplicate.

like image 716
Anderlecht Avatar asked May 15 '18 17:05

Anderlecht


People also ask

Does pandas drop duplicates keep first?

- first : Drop duplicates except for the first occurrence.

What is subset in drop duplicates?

subset: column label or sequence of labels to consider for identifying duplicate rows. By default, all the columns are used to find the duplicate rows.

How do you drop duplicates in DF?

Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.


3 Answers

groupby + pd.Series.mode

This is a two step solution using pd.Series.mode:

# find the mode for each group
i = df.groupby('A').B.apply(pd.Series.mode).reset_index(level=1, drop=True)
# filter out groups which have more than one mode—ambiguous groups
j = i[i.groupby(level=0).transform('count') == 1].reset_index()

print(j)

   A       B
0  1     cat
1  3  donkey

groupby + <custom func>

Alternatively, define a custom function that computes the mode and call it with apply. The filtration logic is subsumed into the function.

def foo(x):
    m = pd.Series.mode(x)
    if len(m) == 1: 
        return m

df.groupby('A').B.apply(foo).reset_index(level=1, drop=True).reset_index()

   A       B
0  1     cat
1  3  donkey
like image 189
cs95 Avatar answered Oct 15 '22 05:10

cs95


How about using value_counts and rank:

df.groupby('A')['B'].apply(lambda x: x.value_counts().rank(ascending=False)).eq(1)[lambda x: x].reset_index()

Output:

   A level_1     B
0  1     cat  True
1  3  donkey  True

Using rank with method='average' as a voter to get "majority" from the results of value_count.

like image 23
Scott Boston Avatar answered Oct 15 '22 05:10

Scott Boston


You can use statistics.mode, which raises StatisticsError if no unique mode exists.

from statistics import mode, StatisticsError

def moder(x):
    try:
        return mode(x)
    except StatisticsError:
        return None

res = df.groupby('A')['B'].apply(moder)\
        .dropna().reset_index()

print(res)

   A       B
0  1     cat
1  3  donkey

Performance benchmarking

Although all 3 methods are suitable for your task, they have slightly difference performance.

Benchmarking results:

df = pd.concat([df]*10000)

%timeit jpp(df)  # 18.3 ms ± 414 µs per loop
%timeit cs1(df)  # 28.1 ms ± 558 µs per loop
%timeit cs2(df)  # 24.5 ms ± 595 µs per loop

Benchmarking code:

from statistics import mode, StatisticsError

def moder(x):
    try:
        return mode(x)
    except StatisticsError:
        return None

def foo(x):
    m = pd.Series.mode(x)
    if len(m) == 1: 
        return m

def jpp(df):
    return df.groupby('A')['B'].apply(moder)\
             .dropna().reset_index()

def cs1(df):
    i = df.groupby('A').B.apply(pd.Series.mode).reset_index(level=1, drop=True)
    return i[i.groupby(level=0).transform('count') == 1].reset_index()

def cs2(df):
    return df.groupby('A').B.apply(foo).reset_index(level=1, drop=True).reset_index()
like image 24
jpp Avatar answered Oct 15 '22 07:10

jpp