I have a matrix with many rows and columns, of the nature
x <- matrix(c(1, 1, 3, 3, 55, 55, 1, 3, 3, 1,
1, 1, 3, 3, 55, 55, 1, 3, 9, 1), ncol = 2)
Within each group of duplicate rows, (i.e. each set of identical rows), I wish to identify the first row index and assign it to all occurences within that group. For example, there are several duplicate rows with 1 in both columns (on rows 1, 2, 7, 10). On each of these rows I want the first row index, i.e. 1.
x
# [,1] [,2]
# [1,] 1 1 # first row of 1-1. Assign its row index, 1, to all 1-1 rows
# [2,] 1 1
# [3,] 3 3 # first row of 3-3. Assign its row index, 3, to all 3-3 rows
# [4,] 3 3
# [5,] 55 55 # first row of 55-55. Assign its row index, 5, to all 55-55 rows
# [6,] 55 55
# [7,] 1 1
# [8,] 3 3
# [9,] 3 9 # first (and only) row of 3-9; row index 9
# [10,] 1 1
Desired result:
1 1 3 3 5 5 1 3 9 1
The best I've come up with is a convoluted approach based on duplicated and for loops, that is neither efficient nor elegant. I'm also aware of possible solutions for data.frames; those involving concatenating rows into strings are quite resource-intensive too.
# Identify duplicates
duplicate <- duplicated(x, MARGIN = 1)
# Identify first occurrence of each duplicate
firstDup <- duplicated(x, MARGIN = 1, fromLast = TRUE) & !duplicate
indices <- which(firstDup)
# Initialize index for unique rows
index <- seq_len(dim(x)[1])
cf <- duplicate
for (i in indices) {
# Duplicates must occur after first occurrence
cf[seq_len(i)] <- FALSE
dups <- apply(x[cf, , drop = FALSE], 1L, identical, x[i, ])
index[which(cf)[dups]] <- i
}
index
Is there an elegant solution using base R?
For integer matrices of equal size but different shapes (5e+06-by-2, 5e+05-by-20, 5000-by-2000), containing integers from 1 to 10, the fastest base answer tested was grouping/match, suggested in a comment by @alexis_laz. The fastest non-base answer was data.table::frank/match, though grouping/match was comparable in all cases, even outperforming the data.table answer in the 5000-by-2000 case.
Note that results may vary for double matrices or integer matrices with greater range, and depending on the number of threads made available to data.table. [TODO?]
@MikaelJagan's asplit/match(<list>, <list>) answer seems like "an elegant solution using base R". However, ?match warns:
Matching for lists is potentially very slow and best avoided except in simple cases.
Given that the OP has "a matrix with many rows and columns", we wanted to compare the performance of the asplit/match(<list>, <list>) answer to that of the other base answers:
paste/match(<chr>, <chr>) answer;interaction/match(<int>, <int>) answer;grouping/match(<int>, <int>) answer.We benchmarked these alongside some non-base answers, which we used as points of reference (recognizing that the OP asked for base only):
Rcpp answer;data.table answers:
which = TRUE and mult = "first" to [.data.table;frank(ties.method = "average")/match(<dbl>, <dbl>),frank(ties.method = "dense")/match(<int>, <int>).library(microbenchmark)
library(data.table)
getDTthreads() # 4
f_asplit <- function(x) {
l <- asplit(x, 1L)
match(l, l) }
f_paste <- function(x) {
s <- do.call(paste, as.data.frame(x))
match(s, s) }
f_interaction <- function(x) {
z <- as.integer(interaction(as.data.frame(x)))
match(z, z) }
f_grouping <- function(x) {
g <- do.call(grouping, as.data.frame(x))
o <- order(g, method = "radix")
e <- attr(g, "ends")
z <- rep.int(seq_along(e), c(e[1L], e[-1L] - e[-length(e)]))[o]
match(z, z) }
f_join <- function(x) {
d <- as.data.table(x)
d[d, on = names(d), mult = "first", which = TRUE] }
f_frank_average <- function(x) {
d <- as.data.table(x)
r <- frank(d, ties.method = "average")
match(r, r) }
f_frank_dense <- function(x) {
d <- as.data.table(x)
r <- frank(d, ties.method = "dense")
match(r, r) }
Rcpp::sourceCpp('<copy source code from @MikaelJagan\'s answer here>')
We first assessed performance using a 5e+06-by-2 integer matrix:
set.seed(1L)
x <- matrix(sample(10L, size = 1e+07L, replace = TRUE), ncol = 2L)
microbenchmark(
f_asplit(x),
f_paste(x),
f_interaction(x),
f_grouping(x),
f_join(x),
f_frank_average(x),
f_frank_dense(x),
f_rcpp(x),
times = 10L,
check = "identical",
setup = gc(FALSE)
)
Unit: milliseconds
expr min lq mean median uq max neval
f_asplit(x) 17369.93905 18861.91195 19070.21298 19013.0180 19207.29194 22420.71085 10
f_paste(x) 502.63884 507.35077 509.01823 509.2443 511.72301 515.10083 10
f_interaction(x) 234.19311 236.52494 241.80098 238.7392 242.32923 259.75644 10
f_grouping(x) 182.25226 182.89358 187.09642 184.6124 187.10444 208.15532 10
f_join(x) 119.43460 120.86829 123.16607 122.9332 125.07169 128.44722 10
f_frank_average(x) 104.40150 107.53607 111.00268 108.5597 116.80375 121.83675 10
f_frank_dense(x) 86.60926 88.29555 91.42976 90.4716 92.32413 99.30659 10
f_rcpp(x) 459.02304 464.79855 472.43669 468.2492 470.25508 523.06734 10
f_asplit is two orders of magnitude slower than the base alternatives. f_grouping is the fastest base answer, but f_frank_dense is faster by a factor of about 2 (and fastest overall).
The results above do not generalize to all integer matrix inputs. For example, f_interaction scales very poorly with ncol(x): the number of possible interactions is u^ncol(x) if each column of x has u unique elements.
For this reason, we performed a second benchmark, this time considering a matrix with fewer rows (5e+05) and more columns (20).
set.seed(1L)
x <- matrix(sample(10L, size = 1e+07L, replace = TRUE), ncol = 20L)
An initial test of f_interaction resulted in a memory allocation error, so it was excluded from the benchmark.
system.time(f_interaction(x))
Error: cannot allocate vector of size 7.5 Gb
Timing stopped at: 173.2 6.05 200.4
microbenchmark(
f_asplit(x),
f_paste(x),
## f_interaction(x),
f_grouping(x),
f_join(x),
f_frank_average(x),
f_frank_dense(x),
f_rcpp(x),
times = 10L,
check = "identical",
setup = gc(FALSE)
)
Unit: milliseconds
expr min lq mean median uq max neval
f_asplit(x) 5416.08762 5681.23523 5731.89246 5732.31779 5905.44517 5913.77141 10
f_paste(x) 592.92990 604.15083 629.31101 623.78679 637.81814 724.83871 10
f_grouping(x) 63.89522 64.14134 65.42723 65.11530 66.00557 68.06045 10
f_join(x) 340.73722 342.18096 353.35774 352.08861 359.88480 382.13480 10
f_frank_average(x) 69.90496 70.81840 72.29819 72.04409 73.11977 77.44347 10
f_frank_dense(x) 52.58033 53.33760 54.42029 54.01672 55.63532 56.99664 10
f_rcpp(x) 184096.21999 184816.36584 185774.76817 186218.58335 186696.31674 186781.24972 10
f_grouping remains the fastest base answer. Notably, it is now faster than f_paste by a full order of magnitude and only marginally slower than f_frank_dense.
We performed a final benchmark excluding the slowest answers in the last round (f_asplit and f_rcpp), now considering a 5000-by-2000 integer matrix:
set.seed(1L)
x <- matrix(sample(10L, size = 1e+07L, replace = TRUE), ncol = 2000L)
microbenchmark(
## f_asplit(x),
f_paste(x),
## f_interaction(x),
f_grouping(x),
f_join(x),
f_frank_average(x),
f_frank_dense(x),
## f_rcpp(x),
times = 10L,
check = "identical",
setup = gc(FALSE)
)
Unit: milliseconds
expr min lq mean median uq max neval
f_paste(x) 1067.47994 1075.45148 1083.17391 1080.72997 1089.74027 1102.45249 10
f_grouping(x) 19.24007 19.50026 19.86404 19.79002 20.25302 20.60127 10
f_join(x) 616.66706 621.29854 630.61460 628.16315 636.39097 650.16180 10
f_frank_average(x) 59.82007 61.41706 62.68610 62.99318 64.56520 64.88463 10
f_frank_dense(x) 58.03648 60.59857 63.50526 61.99278 66.03694 71.30638 10
Now f_grouping is fastest overall, and faster than f_frank_dense by a factor of about 3.
If you have large matrix, then the following solution might suffice:
l <- do.call(paste, data.frame(x))
match(l, l)
[1] 1 1 3 3 5 5 1 3 9 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With