I have a table that looks like this:
A B
1 cat
1 cat
1 dog
2 illama
2 alpaca
3 donkey
Using A
as the key, I'd like to remove duplicates such that that dataframe becomes:
A B
1 cat
3 donkey
1
is duplicated three times, the value cat
occurs the most so it is recorded. there is no majority for 2
so it is considered ambiguous and removed completely. 3
remains as it has no duplicate.
- first : Drop duplicates except for the first occurrence.
subset: column label or sequence of labels to consider for identifying duplicate rows. By default, all the columns are used to find the duplicate rows.
Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.
groupby
+ pd.Series.mode
This is a two step solution using pd.Series.mode
:
# find the mode for each group
i = df.groupby('A').B.apply(pd.Series.mode).reset_index(level=1, drop=True)
# filter out groups which have more than one mode—ambiguous groups
j = i[i.groupby(level=0).transform('count') == 1].reset_index()
print(j)
A B
0 1 cat
1 3 donkey
groupby
+ <
custom func>
Alternatively, define a custom function that computes the mode and call it with apply
. The filtration logic is subsumed into the function.
def foo(x):
m = pd.Series.mode(x)
if len(m) == 1:
return m
df.groupby('A').B.apply(foo).reset_index(level=1, drop=True).reset_index()
A B
0 1 cat
1 3 donkey
How about using value_counts
and rank
:
df.groupby('A')['B'].apply(lambda x: x.value_counts().rank(ascending=False)).eq(1)[lambda x: x].reset_index()
Output:
A level_1 B
0 1 cat True
1 3 donkey True
Using rank with method='average' as a voter to get "majority" from the results of value_count.
You can use statistics.mode
, which raises StatisticsError
if no unique mode exists.
from statistics import mode, StatisticsError
def moder(x):
try:
return mode(x)
except StatisticsError:
return None
res = df.groupby('A')['B'].apply(moder)\
.dropna().reset_index()
print(res)
A B
0 1 cat
1 3 donkey
Performance benchmarking
Although all 3 methods are suitable for your task, they have slightly difference performance.
Benchmarking results:
df = pd.concat([df]*10000)
%timeit jpp(df) # 18.3 ms ± 414 µs per loop
%timeit cs1(df) # 28.1 ms ± 558 µs per loop
%timeit cs2(df) # 24.5 ms ± 595 µs per loop
Benchmarking code:
from statistics import mode, StatisticsError
def moder(x):
try:
return mode(x)
except StatisticsError:
return None
def foo(x):
m = pd.Series.mode(x)
if len(m) == 1:
return m
def jpp(df):
return df.groupby('A')['B'].apply(moder)\
.dropna().reset_index()
def cs1(df):
i = df.groupby('A').B.apply(pd.Series.mode).reset_index(level=1, drop=True)
return i[i.groupby(level=0).transform('count') == 1].reset_index()
def cs2(df):
return df.groupby('A').B.apply(foo).reset_index(level=1, drop=True).reset_index()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With