How can I efficiently match/group the indices of duplicated rows?
Let's say I have this data set:
set.seed(14)
dat <- data.frame(mtcars[sample(1:5, 14, TRUE), ])[sample.int(14), ]
rownames(dat) <- NULL
dat
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 4 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 5 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 6 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 7 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 8 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 9 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 10 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 11 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 12 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 13 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 14 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
I can find all the indices of duplicates (including the first duplicate) using
which_duplicated <- function(dat){
which(duplicated(dat) | duplicated(dat[nrow(dat):1, ])[nrow(dat):1])
}
which_duplicated(dat)
## [1] 1 2 3 4 5 6 7 8 9 10 11 13
But I want to be able to match those indices up as seen below:
list(
c(2, 13),
c(1, 4, 5, 6, 9),
c(3, 7, 8, 10, 11)
)
How can I do this efficiently?
Here's a possibility using "data.table":
library(data.table)
as.data.table(dat)[, c("GRP", "N") := .(.GRP, .N), by = names(dat)][
N > 1, list(list(.I)), by = GRP]
## GRP V1
## 1: 1 1,4,5,6,9
## 2: 2 2,13
## 3: 3 3, 7, 8,10,11
The basic idea is to create a column that "groups" the other columns (using .GRP
) as well as a column that counts how many duplicate rows there are (using .N
), then filtering anything that has more than one duplicate, and putting the "GRP" column into a list
.
We can use dplyr
. Using a similar methodology as @AnandaMahto's post, we create a row index column name (add_rownames(
), group by all the columns, we filter
the dataset with number of rows in each group greater than 1, summarise
the 'rowname' to a list
and extract that list
column.
library(dplyr)
add_rownames(dat) %>%
group_by_(.dots= names(dat)) %>%
filter(n()>1) %>%
summarise(rn= list(rowname))%>%
.$rn
#[[1]]
#[1] "3" "7" "8" "10" "11"
#[[2]]
#[1] "2" "13"
#[[3]]
#[1] "1" "4" "5" "6" "9"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With