I have a dataframe of two columns like so:
Name | Salary |
---|---|
Name 1 | 5000000 |
Name 2 | 7000000 |
Name 3 | 9000000 |
Name 4 | 12000000 |
Name 5 | 14000000 |
I want to find every combination of salaries that reach 20,000,000. It can go over 20,000,000, but once it does, it stops. For example, names 1, 2 and 3 would be a valid combination, as would names 4 and 5.
I would also love to be able to view the salary combinations with their associates names in a dataframe.
Note that my dataframes typically have about 12 to 15 names across 30 different dataframes.
So far, I've tried to wrap my head around combn(), but I could only choose a specific number "m" as far as I know and could not choose a target number when the combinations stop. I also couldn't get anywhere with the "sets" package.
Here is one way that enumerates all the combinations and selects the ones that are feasible. Note that this is very inefficient as since once can write a function that is short circuited. ie if I know that all 3 combinations are greater than the target, then there is no need to do 4,5,6,etc combinations. Also If I know that a subset is greater than a target, there is no need to do any combination containing that particular subset. I ignored all these thus why I said the method is inefficient:
g <- function(df){
n <- seq_len(nrow(df))
df <- df[order(-df$Salary),]
fn <- \(x)if(sum(cumsum(df$Salary[x])>20e6) == 1)df$Name[x]
gn <- \(x)Filter(length, combn(n, x, fn, simplify = FALSE))
unlist(sapply(n, gn), FALSE)
}
g(df)
[[1]]
[1] "Name 5" "Name 4"
[[2]]
[1] "Name 5" "Name 3"
[[3]]
[1] "Name 5" "Name 2"
[[4]]
[1] "Name 4" "Name 3"
[[5]]
[1] "Name 4" "Name 2" "Name 1"
[[6]]
[1] "Name 3" "Name 2" "Name 1"
You can try the following base R code, where the search stops whenever the combination sum reaches the given given target tg
.
The idea is that: We can imagine that the search is along a tree, and transverse stops when the condition is met, where the accumulated visited nodes along the path is treated as a new parent to check the incoming child nodes.
f <- function(df, tg) {
v <- sort(with(df, setNames(Salary, Name)), TRUE)
root <- Map(setNames, v, names(v))
out <- c()
repeat {
leaf <- c()
for (r in root) {
candidates <- v[v < tail(r, 1)]
if (length(candidates)) {
for (k in seq_along(candidates)) {
p <- c(r, candidates[k])
if (sum(p) >= tg) {
out <- c(out, list(p))
} else {
leaf <- c(leaf, list(p))
}
}
}
}
if (length(root) == length(leaf)) {
return(out)
}
root <- leaf
}
}
> f(df, 20e6)
[[1]]
Name5 Name4
14000000 12000000
[[2]]
Name5 Name3
14000000 9000000
[[3]]
Name5 Name2
14000000 7000000
[[4]]
Name4 Name3
12000000 9000000
[[5]]
Name4 Name2 Name1
12000000 7000000 5000000
[[6]]
Name3 Name2 Name1
9000000 7000000 5000000
with the given df
, we compare two approaches
f <- function(df, tg) {
v <- sort(with(df, setNames(Salary, Name)), TRUE)
root <- Map(setNames, v, names(v))
out <- c()
repeat {
leaf <- c()
for (r in root) {
candidates <- v[v < tail(r, 1)]
if (length(candidates)) {
for (k in seq_along(candidates)) {
p <- c(r, candidates[k])
if (sum(p) >= tg) {
out <- c(out, list(p))
} else {
leaf <- c(leaf, list(p))
}
}
}
}
if (length(root) == length(leaf)) {
return(out)
}
root <- leaf
}
}
g <- function(df, tg) {
n <- seq_len(nrow(df))
df <- df[order(-df$Salary), ]
fn <- \(x)if (sum(cumsum(df$Salary[x]) >= tg) == 1) df$Name[x]
gn <- \(x)Filter(length, combn(n, x, fn, simplify = FALSE))
unlist(sapply(n, gn), FALSE)
}
microbenchmark(
TIC = f(df, 20e6),
Onyambu = g(df, 20e6),
unit = "relative"
)
we can see (but we should be aware that the speed also depends on the target value, so the benchmarking below is just for reference regarding OP's specific example data in the post, should be tested in more general cases)
Unit: relative
expr min lq mean median uq max neval
TIC 1.000000 1.00000 1.000000 1.000000 1.000000 1.0000000 100
Onyambu 1.627361 1.62317 1.240159 1.645295 1.625882 0.6493649 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With