I was trying to solve a problem where there was a long list that had a variable amount of numbers at each index. The goals was to say what was the earliest index at which every number appeared. So if a 15 shows up at index 45 and 78 in the list, then I should return that 15 is located first at 48. In the original problem this went on with a list of length 10,000, so doing this fast was helpful.
Originally I tried to work with the existing list structure and did something like this, which at 10,000 lines very slow.
set.seed(1)
x <- replicate(100, sample(100, sample(10, 1)))
cbind(value = 1:100,
index = sapply(1:100, function(i) which.max(sapply(x, function(x) i %in% x))))
Eventually I tried converting the data in to a data.table, which worked much better but I always wondered if there was a better way to go about solving the problem. Like was the default list structure inherently inefficient or was there a better way I could have worked with it?
set.seed(1)
x <- replicate(100, sample(100, sample(10, 1)))
dt <- data.table(index = rep(1:100, sapply(x, length)), value = unlist(x))
dt[,.(index = first(index)),value][order(value)]
Here's the full dataset from the original problem if that's helpful.
library(RcppAlgos)
library(memoise)
library(data.table)
jgo <- function(n) {
if (isPrimeRcpp(n) | n == 1) return (n)
div <- divisorsRcpp(n)
div <- div[-c(1, length(div))]
div <- Map(function(a, b) c(a, b), div, rev(div))
div2 <- lapply(div, function(x) lapply(jgo(x[1]), c, x[2]))
unique(lapply(c(div, unlist(div2, recursive = FALSE)), sort))
}
jgo <- memoise(jgo)
x <- lapply(1:12500, function(x) x - sapply(jgo(x), sum) + sapply(jgo(x), length))
Here is another approach that uses match
to find the first indices. This slightly outperforms the other suggested approaches and produces similar output as in OP's question:
## dummy data
set.seed(1)
x <- replicate(100, sample(100, sample(10, 1)))
## use match to find first indices
first_indices_match <- function(x) {
seq_x <- 1:length(x)
matrix(c(seq_x, rep(seq_x, lengths(x))[match(seq_x, unlist(x))]),
ncol = 2, dimnames = list(NULL, c("value", "index")))
}
head(first_indices_match(x))
#> value index
#> [1,] 1 1
#> [2,] 2 7
#> [3,] 3 45
#> [4,] 4 38
#> [5,] 5 31
#> [6,] 6 7
## data.table approach
library(data.table)
first_indices_dt <- function(x) {
dt <- data.table(index = rep(seq_along(x), sapply(x, length)), value = unlist(x))
dt[,.(index = first(index)),value][order(value)]
}
head(first_indices_dt(x))
#> value index
#> 1: 1 1
#> 2: 2 7
#> 3: 3 45
#> 4: 4 38
#> 5: 5 31
#> 6: 6 7
Benchmarks
## stack + remove duplicate approach
first_indices_shree <- function(x) {
names(x) <- seq_len(length(x))
(d <- stack(x))[!duplicated(d$values), ]
}
## benchmarks several list sizes
bnch <- bench::press(
n_size = c(100, 1E3, 1E4),
{
x <- replicate(n_size, sample(n_size, sample(10, 1)))
bench::mark(
match = first_indices_match(x),
shree = first_indices_shree(x),
dt = first_indices_dt(x),
check = FALSE
)
}
)
#> # A tibble: 9 x 7
#> expression n_size min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 match 100 18.17µs 21.2µs 45639. 637.3KB 27.4
#> 2 shree 100 361.88µs 411.06µs 2307. 106.68KB 11.1
#> 3 dt 100 759.17µs 898.26µs 936. 264.58KB 8.51
#> 4 match 1000 158.34µs 169.9µs 5293. 164.15KB 30.8
#> 5 shree 1000 1.54ms 1.71ms 567. 412.52KB 13.2
#> 6 dt 1000 1.19ms 1.4ms 695. 372.13KB 10.7
#> 7 match 10000 3.09ms 3.69ms 255. 1.47MB 15.9
#> 8 shree 10000 18.06ms 18.95ms 51.5 4.07MB 12.9
#> 9 dt 10000 5.65ms 6.33ms 149. 2.79MB 20.5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With