I have a dataframe like <pre class="prettyprint"><code>col1 col2 col3 A B C A B C A B B A B B A B C B C A </code></pre> I want to get an output in the below format: <pre class="prettyprint"><code>col1 col2 col3 Count A B C 3 Duplicates A B B 2 Duplicates </code></pre> I dont want to use any specific column in the function to find the duplicates. That is the reason of not using add_count from dplyr. Using duplicate will have <pre class="prettyprint"><code> col1 col2 col3 count 2 A B C 3 3 A B B 2 5 A B C 3 </code></pre> So not the desired output.

We can use <code>group_by_all</code> to group by all columns and then remove the ones which are not duplicates by selecting rows which have count > 1. <pre class="prettyprint"><code>library(dplyr) df %>% group_by_all() %>% count() %>% filter(n > 1) # col1 col2 col3 n # <fct> <fct> <fct> <int> #1 A B B 2 #2 A B C 3 </code></pre>

We can use <code>data.table</code> <pre class="prettyprint"><code>library(data.table) setDT(df1)[, .(n =.N), names(df1)][n > 1] # col1 col2 col3 n #1: A B C 3 #2: A B B 2 </code></pre> <hr> Or with <code>base R</code> <pre class="prettyprint"><code>subset(aggregate(n ~ ., transform(df1, n = 1), FUN = sum), n > 1) # col1 col2 col3 n #2 A B B 2 #3 A B C 3 </code></pre>

Finding duplicates in a dataframe and returning count of each duplicate record

Tags:

r

I have a dataframe like

col1 col2 col3
A    B    C
A    B    C
A    B    B
A    B    B
A    B    C
B    C    A

I want to get an output in the below format:

col1 col2 col3 Count
A    B    C    3 Duplicates
A    B    B    2 Duplicates

I dont want to use any specific column in the function to find the duplicates.

That is the reason of not using add_count from dplyr.

Using duplicate will have

    col1 col2 col3 count
2   A    B    C    3
3   A    B    B    2
5   A    B    C    3

So not the desired output.

223

asked Dec 14 '18 04:12

Deep

2 Answers

We can use group_by_all to group by all columns and then remove the ones which are not duplicates by selecting rows which have count > 1.

library(dplyr)

df %>%
  group_by_all() %>%
  count() %>%
  filter(n > 1)

#  col1  col2  col3      n
# <fct> <fct> <fct>   <int>
#1 A     B     B         2
#2 A     B     C         3

150

answered Oct 13 '22 00:10

Ronak Shah

We can use data.table

library(data.table)
setDT(df1)[, .(n =.N), names(df1)][n > 1]
#   col1 col2 col3 n
#1:    A    B    C 3
#2:    A    B    B 2

Or with base R

subset(aggregate(n ~ ., transform(df1, n = 1), FUN = sum), n > 1)
#  col1 col2 col3 n
#2    A    B    B 2
#3    A    B    C 3

answered Oct 13 '22 00:10

akrun

Related questions
                            
                                Setting paper size for PDF output in bookdown with tufte-book
                            
                                Select all rows which are duplicates except for one column
                            
                                How to convert fitdistrplus::fitdist summary into tidy format?
                            
                                Multiplying elements of a matrix depending on numbers and strings of the row names and column names (2)
                            
                                Dynamic scoping questions in R
                            
                                Xaringan Title Slide
                            
                                Convert five-year data to annual data and calculate new records in R
                            
                                How to subset your dataframe to only keep the first duplicate? [duplicate]
                            
                                Is there a way to hide Trace Names in Plotly (specifically R)?
                            
                                add_trace: control the color
                            
                                Adjust font size to size of plot device
                            
                                Reverse ggplot time scale
                            
                                data plotted in ggplot to highcharter
                            
                                Wrap long url line in R markdown
                            
                                combining POSIXct gives wrong hours [duplicate]
                            
                                Correct way to use here package with cronR scheduling
                            
                                ggplot `scale_fill_manual()` alternate colors infinitely
                            
                                How/When is the HOME environment variable set within R
                            
                                How to write files with Unix end of lines on R for Windows
                            
                                How to change the units of difftime after computing the values, and not using units="xxx" when performing the calculations

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With