I have a dataframe like
col1 col2 col3
A B C
A B C
A B B
A B B
A B C
B C A
I want to get an output in the below format:
col1 col2 col3 Count
A B C 3 Duplicates
A B B 2 Duplicates
I dont want to use any specific column in the function to find the duplicates.
That is the reason of not using add_count from dplyr.
Using duplicate will have
col1 col2 col3 count
2 A B C 3
3 A B B 2
5 A B C 3
So not the desired output.
You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method.
You can use groupby with function size. Then I reset index with rename column 0 to count .
Operator. countOf() is used for counting the number of occurrences of b in a. It counts the number of occurrences of value. It returns the Count of a number of occurrences of value.
We can use group_by_all
to group by all columns and then remove the ones which are not duplicates by selecting rows which have count > 1.
library(dplyr)
df %>%
group_by_all() %>%
count() %>%
filter(n > 1)
# col1 col2 col3 n
# <fct> <fct> <fct> <int>
#1 A B B 2
#2 A B C 3
We can use data.table
library(data.table)
setDT(df1)[, .(n =.N), names(df1)][n > 1]
# col1 col2 col3 n
#1: A B C 3
#2: A B B 2
Or with base R
subset(aggregate(n ~ ., transform(df1, n = 1), FUN = sum), n > 1)
# col1 col2 col3 n
#2 A B B 2
#3 A B C 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With