I have a table that looks like this: <pre class="prettyprint"><code>A B 1 cat 1 cat 1 dog 2 illama 2 alpaca 3 donkey </code></pre> Using <code>A</code> as the key, I'd like to remove duplicates such that that dataframe becomes: <pre class="prettyprint"><code>A B 1 cat 3 donkey </code></pre> <code>1</code> is duplicated three times, the value <code>cat</code> occurs the most so it is recorded. there is no majority for <code>2</code> so it is considered ambiguous and removed completely. <code>3</code> remains as it has no duplicate.

<h3> <code>groupby</code> + <code>pd.Series.mode</code> </h3> This is a two step solution using <code>pd.Series.mode</code>: <pre class="prettyprint"><code># find the mode for each group i = df.groupby('A').B.apply(pd.Series.mode).reset_index(level=1, drop=True) # filter out groups which have more than one mode—ambiguous groups j = i[i.groupby(level=0).transform('count') == 1].reset_index() </code></pre> <pre class="prettyprint"><code>print(j) A B 0 1 cat 1 3 donkey </code></pre> <hr> <h3> <code>groupby</code> + <code><</code>custom func<code>></code> </h3> Alternatively, define a custom function that computes the mode and call it with <code>apply</code>. The filtration logic is subsumed into the function. <pre class="prettyprint"><code>def foo(x): m = pd.Series.mode(x) if len(m) == 1: return m df.groupby('A').B.apply(foo).reset_index(level=1, drop=True).reset_index() A B 0 1 cat 1 3 donkey </code></pre>

How about using <code>value_counts</code> and <code>rank</code>: <pre class="prettyprint"><code>df.groupby('A')['B'].apply(lambda x: x.value_counts().rank(ascending=False)).eq(1)[lambda x: x].reset_index() </code></pre> Output: <pre class="prettyprint"><code> A level_1 B 0 1 cat True 1 3 donkey True </code></pre> Using rank with method='average' as a voter to get "majority" from the results of value_count.

Drop duplicates based on majority rule

Tags:

python

pandas

dataframe

duplicates

I have a table that looks like this:

A  B
1  cat
1  cat
1  dog
2  illama
2  alpaca
3  donkey

Using A as the key, I'd like to remove duplicates such that that dataframe becomes:

A  B
1  cat
3  donkey

1 is duplicated three times, the value cat occurs the most so it is recorded. there is no majority for 2 so it is considered ambiguous and removed completely. 3 remains as it has no duplicate.

716

asked May 15 '18 17:05

Anderlecht

3 Answers

`groupby` + `pd.Series.mode`

This is a two step solution using pd.Series.mode:

# find the mode for each group
i = df.groupby('A').B.apply(pd.Series.mode).reset_index(level=1, drop=True)
# filter out groups which have more than one mode—ambiguous groups
j = i[i.groupby(level=0).transform('count') == 1].reset_index()

print(j)

   A       B
0  1     cat
1  3  donkey

`groupby` + `<`custom func`>`

Alternatively, define a custom function that computes the mode and call it with apply. The filtration logic is subsumed into the function.

def foo(x):
    m = pd.Series.mode(x)
    if len(m) == 1: 
        return m

df.groupby('A').B.apply(foo).reset_index(level=1, drop=True).reset_index()

   A       B
0  1     cat
1  3  donkey

189

answered Oct 15 '22 05:10

cs95

How about using value_counts and rank:

df.groupby('A')['B'].apply(lambda x: x.value_counts().rank(ascending=False)).eq(1)[lambda x: x].reset_index()

Output:

   A level_1     B
0  1     cat  True
1  3  donkey  True

Using rank with method='average' as a voter to get "majority" from the results of value_count.

answered Oct 15 '22 05:10

Scott Boston

You can use statistics.mode, which raises StatisticsError if no unique mode exists.

from statistics import mode, StatisticsError

def moder(x):
    try:
        return mode(x)
    except StatisticsError:
        return None

res = df.groupby('A')['B'].apply(moder)\
        .dropna().reset_index()

print(res)

   A       B
0  1     cat
1  3  donkey

Performance benchmarking

Although all 3 methods are suitable for your task, they have slightly difference performance.

Benchmarking results:

df = pd.concat([df]*10000)

%timeit jpp(df)  # 18.3 ms ± 414 µs per loop
%timeit cs1(df)  # 28.1 ms ± 558 µs per loop
%timeit cs2(df)  # 24.5 ms ± 595 µs per loop

Benchmarking code:

from statistics import mode, StatisticsError

def moder(x):
    try:
        return mode(x)
    except StatisticsError:
        return None

def foo(x):
    m = pd.Series.mode(x)
    if len(m) == 1: 
        return m

def jpp(df):
    return df.groupby('A')['B'].apply(moder)\
             .dropna().reset_index()

def cs1(df):
    i = df.groupby('A').B.apply(pd.Series.mode).reset_index(level=1, drop=True)
    return i[i.groupby(level=0).transform('count') == 1].reset_index()

def cs2(df):
    return df.groupby('A').B.apply(foo).reset_index(level=1, drop=True).reset_index()

answered Oct 15 '22 07:10

jpp

Related questions
                            
                                How to pass variable values dynamically in pandas sql query
                            
                                Delete list word combination python 3
                            
                                Pandas split dataframe into multiple when condition is true
                            
                                Hide bq command line `query` output
                            
                                Pandas Dataframe: How to select a row by index, and then get the next few rows
                            
                                Why does "1 in range(2) == True" evaluate to False? [duplicate]
                            
                                Pandas sumproduct
                            
                                how to highlight weekends for time series line plot in python
                            
                                OpenCV live stream video over socket in Python 3
                            
                                Scan every possible port in a host using Python
                            
                                Tensorflow error : unsupported callable
                            
                                Access file in external hard drive using python on mac
                            
                                Converting a dictionary of dictionaries to a List of dictionaries
                            
                                How to get endianness of numpy dtype
                            
                                Cloudwatch event is not triggering my lambda function, even though it's a target
                            
                                Django : Use multiple CSS file in one html
                            
                                How can I know the type of a pandas dataframe cell
                            
                                Odoo docker image: how to scaffold?
                            
                                Python Django manage.py runserver too many values to unpack Passing a 3-tuple to include() is not supported
                            
                                Pandas, print variable in string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Drop duplicates based on majority rule

Tags:

python

pandas

dataframe

duplicates

Anderlecht

People also ask

3 Answers

`groupby` + `pd.Series.mode`

`groupby` + `<`custom func`>`

cs95

Scott Boston

jpp

Recent Activity

Donate For Us

Drop duplicates based on majority rule

Tags:

python

pandas

dataframe

duplicates

Anderlecht

People also ask

3 Answers

groupby + pd.Series.mode

groupby + <custom func>

cs95

Scott Boston

jpp

Related questions

Recent Activity

Donate For Us

`groupby` + `pd.Series.mode`

`groupby` + `<`custom func`>`