Find duplicate values in R [duplicate]

Tags:

r

People also ask

Is duplicated in R?

duplicated() in RThe duplicated() is a built-in R function that determines which elements of a vector or data frame are duplicates of elements with smaller subscripts and returns a logical vector indicating which elements (rows) are duplicates.

How do you check if there are duplicate rows in R?

We can find the rows with duplicated values in a particular column of an R data frame by using duplicated function inside the subset function. This will return only the duplicate rows based on the column we choose that means the first unique value will not be in the output.

How do you filter out duplicates in R?

Remove Duplicate rows in R using Dplyr – distinct () function. Distinct function in R is used to remove duplicate rows in R using Dplyr package. Dplyr package in R is provided with distinct() function which eliminate duplicates rows with single variable or with multiple variable.

You could use table, i.e.

n_occur <- data.frame(table(vocabulary$id))

gives you a data frame with a list of ids and the number of times they occurred.

n_occur[n_occur$Freq > 1,]

tells you which ids occurred more than once.

vocabulary[vocabulary$id %in% n_occur$Var1[n_occur$Freq > 1],]

returns the records with more than one occurrence.

This will give you duplicate rows:

vocabulary[duplicated(vocabulary$id),]

This will give you the number of duplicates:

dim(vocabulary[duplicated(vocabulary$id),])[1]

Example:

vocabulary2 <-rbind(vocabulary,vocabulary[1,]) #creates a duplicate at the end
vocabulary2[duplicated(vocabulary2$id),]
#            id year    sex education vocabulary
#21639 20040001 2004 Female         9          3
dim(vocabulary2[duplicated(vocabulary2$id),])[1]
#[1] 1 #=1 duplicate

EDIT

OK, with the additional information, here's what you should do: duplicated has a fromLast option which allows you to get duplicates from the end. If you combine this with the normal duplicated, you get all duplicates. The following example adds duplicates to the original vocabulary object (line 1 is duplicated twice and line 5 is duplicated once). I then use table to get the total number of duplicates per ID.

#Create vocabulary object with duplicates
voc.dups <-rbind(vocabulary,vocabulary[1,],vocabulary[1,],vocabulary[5,])

#List duplicates
dups <-voc.dups[duplicated(voc.dups$id)|duplicated(voc.dups$id, fromLast=TRUE),]
dups
#            id year    sex education vocabulary
#1     20040001 2004 Female         9          3
#5     20040008 2004   Male        14          1
#21639 20040001 2004 Female         9          3
#21640 20040001 2004 Female         9          3
#51000 20040008 2004   Male        14          1

#Count duplicates by id
table(dups$id)
#20040001 20040008 
#       3        2

Here, I summarize a few ways which may return different results to your question, so be careful:

# First assign your "id"s to an R object.
# Here's a hypothetical example:
id <- c("a","b","b","c","c","c","d","d","d","d")

#To return ALL MINUS ONE duplicated values:
id[duplicated(id)]
## [1] "b" "c" "c" "d" "d" "d"

#To return ALL duplicated values by specifying fromLast argument:
id[duplicated(id) | duplicated(id, fromLast=TRUE)]
## [1] "b" "b" "c" "c" "c" "d" "d" "d" "d"

#Yet another way to return ALL duplicated values, using %in% operator:
id[ id %in% id[duplicated(id)] ]
## [1] "b" "b" "c" "c" "c" "d" "d" "d" "d"

Hope these help. Good luck.

Here's a data.table solution that will list the duplicates along with the number of duplications (will be 1 if there are 2 copies, and so on - you can adjust that to suit your needs):

library(data.table)
dt = data.table(vocabulary)

dt[duplicated(id), cbind(.SD[1], number = .N), by = id]

A terser way, either with rev :

x[!(!duplicated(x) & rev(!duplicated(rev(x))))]

... rather than fromLast:

x[!(!duplicated(x) & !duplicated(x, fromLast = TRUE))]

... and as a helper function to provide either logical vector or elements from original vector :

duplicates <- function(x, as.bool = FALSE) {
    is.dup <- !(!duplicated(x) & rev(!duplicated(rev(x))))
    if (as.bool) { is.dup } else { x[is.dup] }
}

Treating vectors as data frames to pass to table is handy but can get difficult to read, and the data.table solution is fine but I'd prefer base R solutions for dealing with simple vectors like IDs.

Related questions
                            
                                How to print (to paper) a nicely-formatted data frame
                            
                                How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame
                            
                                Virtual environment in R?
                            
                                Extract regression coefficient values
                            
                                Change the spacing of tick marks on the axis of a plot?
                            
                                Reset par to the default values at startup
                            
                                Coerce multiple columns to factors at once
                            
                                shiny 4 small textInput boxes side-by-side
                            
                                python equivalent of R table
                            
                                Remove multiple columns from data.table
                            
                                Convert a data frame to a data.table without copy
                            
                                How to sum data.frame column values?
                            
                                Elegant indexing up to end of vector/matrix
                            
                                How to see the source code of R .Internal or .Primitive function?
                            
                                What does the dot mean in R – personal preference, naming convention or more?
                            
                                What are examples of when seq_along works, but seq produces unintended results?
                            
                                How to get week numbers from dates?
                            
                                Pasting two vectors with combinations of all vectors' elements
                            
                                Calculate the mean by group
                            
                                Merging a lot of data.frames [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find duplicate values in R [duplicate]

Tags:

r

People also ask

EDIT

Related questions

Recent Activity

Donate For Us