In a matrix, e.g. M1
, rows are countries and columns are years. The countries don't have observations for the same years. I want to find the "best" intersect of years that gives me the most countries. The number of minimum years and minimum countries will be predefined. Which countries are included in the result doesn't matter, the years don't have to be consecutive.
> M1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] NA NA NA 2004 NA 2006 NA 2008 2009 NA 2011 2012 NA NA NA
[2,] NA 2002 NA 2004 NA NA 2007 NA NA 2010 2011 NA 2013 2014 NA
[3,] NA NA NA 2004 2005 2006 2007 2008 2009 NA NA 2012 2013 NA 2015
[4,] NA 2002 NA 2004 2005 2006 2007 2008 NA 2010 2011 NA 2013 NA NA
[5,] 2001 NA NA NA 2005 2006 2007 2008 NA 2010 NA 2012 2013 2014 NA
[6,] 2001 NA 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 NA 2014 NA
[7,] 2001 2002 NA NA 2005 NA 2007 NA 2009 NA 2011 NA NA 2014 2015
[8,] 2001 2002 NA 2004 2005 2006 NA NA NA 2010 NA NA 2013 NA 2015
[9,] NA 2002 NA 2004 2005 NA 2007 NA NA 2010 2011 NA NA NA NA
[10,] 2001 2002 NA 2004 NA NA NA NA NA 2010 NA 2012 NA 2014 2015
Because there is no obvious intersect, a single Reduce(intersect...)
attempt won't work, and I do that repeatedly by successively excluding one country up to the defined threshold n.row
. The result is filtered for a minimum of years n.col
. I wrote this function,
findBestIntersect <- function(M, min.row=5, min.col=3) {
## min.row: minimum number of rows (countries) to analyze
## min.col: minimum number of complete columns (years)
# put matrices with row combn into list (HUGE!)
L1 <- lapply(min.row:(nrow(M) - 1), function(x)
combn(nrow(M), x, function(i) M[i, ], simplify=FALSE))
# select lists w/ def. number of complete columns
slc <- sapply(L1, function(y) # numbers of lists
which(sapply(y, function(x)
sum(!(apply(x, 2, function(i) any(is.na(i))))))
>= min.col))
# list selected lists
L2 <- Map(function(x, i)
x[i], L1[lengths(slc) > 0], slc[lengths(slc) > 0])
# find intersects
L3 <- rapply(L2, function(l)
as.integer(na.omit(Reduce(intersect, as.list(as.data.frame(t(l)))))),
how="list")
return(unique(unlist(L3, recursive=FALSE)))
}
which gives me the desired result for M1
in no time.
> system.time(best.yrs.1 <- findBestIntersect(M1))
user system elapsed
0.06 0.00 0.07
> best.yrs.1
[[1]]
[1] 2002 2004 2010
However the performance for M2
was only just acceptable (RAM usage around 1.1 GB),
> system.time(best.yrs.2 <- findBestIntersect(M2))
user system elapsed
79.90 0.39 82.76
> head(best.yrs.2, 3)
[[1]]
[1] 2002 2009 2015
[[2]]
[1] 2002 2014 2015
[[3]]
[1] 2003 2009 2010
and you don't want to try this with M3
(blasts 32 GB RAM) which resembles my real matrix:
# best.yrs.3 <- findBestIntersect(M3)
Probably the biggest flaw of the function is that L1
becomes too big very fast.
So, my question is, would there be a better method that is also applicable to M3
? The "bonus" would be to maximize both, countries and years. If possible I want to do this without additional packages.
set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 150, replace=TRUE), 10)
M1 <- t(replicate(10, 2001:2015, simplify=TRUE))
M1[tf] <- NA
tf <- matrix(sample(c(TRUE, FALSE), 300, replace=TRUE), 20)
M2 <- t(replicate(20, 2001:2015, simplify=TRUE))
M2[tf] <- NA
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA
Any 2 columns (or rows) of a matrix can be exchanged. If the ith and jth rows are exchanged, it is shown by Ri ↔ Rj and if the ith and jth columns are exchanged, it is shown by Ci ↔ Cj.
A cell is the intersection point of a column and a row.
I wrote a coded_best_intersect
function that relies on creating a for loop dynamically in a code_maker
function. It evaluates M3
in 30 seconds. Because the code generates a list, I am depending on data.table
for rbindlist
and the print method.
library(data.table)
code_maker
function:code_maker <- function(non_NA_M, n, k, min.col) {
## initializing for results
res <- list()
z <- 1
## initializing naming
col_names <- colnames(non_NA_M)
i_s <- paste0('i', seq_len(k))
## create the foor loop text. It looks like this mostly
## for (i1 in 1:(n - k + 1)) { for (i2 in (i1 + 1):(n-k+2)) {}}
for_loop <- paste0('for (', i_s, ' in ', c('1:', paste0('(', i_s[-k], ' + 1):')),
n - k + seq_len(k), ')', ' {\n non_na_sums', seq_len(k),
'=non_NA_M[', i_s, ', ] ',
c('', paste0('& ', rep('non_na_sums', k - 1), seq_len(k)[-k])), '',
'\n if (sum(non_na_sums', seq_len(k), ') < ', min.col, ') {next} ',
collapse='\n')
## create the assignment back to the results which looks like
## res[[z]] <- data.table(M=k, N=sum(non_na_sumsk), ROWS=list(c(i1, i2, ..., ik)),
## YEARS=list(col_names[non_na_sumsk]))
inner_text <- paste0('\nres[[z]] <- data.table(M=k, N=sum(non_na_sums',
k, '), ROWS=list(c( ', paste0(i_s, collapse=', '),
')), YEARS=list(col_names[non_na_sums', k , ']))\nz <- z + 1')
## combines the loop parts and closes the for with }}}
for_loop <- paste(for_loop,
inner_text,
paste0(rep('}', k), collapse=''))
## evaluate - the evaluation will assign back to res[[i]]
eval(parse(text=for_loop))
res <- rbindlist(res)
if (length(res) == 0) { #to return emtpy data.table with the correct fields
return(data.table(M=integer(), N=integer(), ROWS=list(), YEARS=list()))
}
res$M <- k
return(res)
}
coded_best_intersect
function:coded_best_intersect <- function(M, min.row=5, min.col=3) {
colnames(M) <- apply(M, 2, function(x) na.omit(x)[1])
n_row <- nrow(M)
non_NA <- !is.na(M)
n_combos <- min.row:(n_row - 1)
res2 <- list()
for (i in seq_along(n_combos)) {
res2[[i]] <- code_maker(non_NA, n=n_row, k=n_combos[i], min.col)
if (nrow(res2[[i]]) == 0) {
break
}
}
return(res2)
}
This is e.g. the code generated on the fly for k=5
:
# for (i1 in 1:5) {
# non_na_sums1=non_NA_M[i1, ]
# if (sum(non_na_sums1) < 3) {next}
# for (i2 in (i1 + 1):6) {
# non_na_sums2=non_NA_M[i2, ] & non_na_sums1
# if (sum(non_na_sums2) < 3) {next}
# for (i3 in (i2 + 1):7) {
# non_na_sums3=non_NA_M[i3, ] & non_na_sums2
# if (sum(non_na_sums3) < 3) {next}
# for (i4 in (i3 + 1):8) {
# non_na_sums4=non_NA_M[i4, ] & non_na_sums3
# if (sum(non_na_sums4) < 3) {next}
# for (i5 in (i4 + 1):9) {
# non_na_sums5=non_NA_M[i5, ] & non_na_sums4
# if (sum(non_na_sums5) < 3) {next}
# for (i6 in (i5 + 1):10) {
# non_na_sums6=non_NA_M[i6, ] & non_na_sums5
# if (sum(non_na_sums6) < 3) {next}
# res[[z]] <- data.table(M=k, N=sum(non_na_sums6),
# ROWS=list(c( i1, i2, i3, i4, i5, i6)),
# YEARS=list(col_names[non_na_sums6]))
# z <- z + 1 }}}}}}
You can likely notice the {next}
which is a way for it to skip a combination if there's no possible way to get a minimum of 3 columns. And while it looks like it is all hard-coded, the code is actually a string generated, parsed, and then evaluated.
Matrix M1
:
system.time(final1 <- coded_best_intersect(M1))
user system elapsed
0 0 0
data.table::rbindlist(final1)[order(-M*N)]
M N ROWS YEARS
1: 5 3 2, 4, 8, 9,10 2002,2004,2010
Matrix M2
:
system.time(final2 <- coded_best_intersect(M2))
user system elapsed
0.08 0.00 0.08
data.table::rbindlist(final2)[order(-M*N)]
M N ROWS YEARS
1: 7 3 6, 8,11,12,13,16,... 2002,2012,2013
2: 5 4 6, 8,13,16,17 2002,2012,2013,2015
3: 5 4 8,11,12,13,17 2002,2012,2013,2014
4: 6 3 1, 4, 8,13,17,20 2002,2014,2015
5: 6 3 2, 5, 6,10,14,17 2003,2006,2008
---
126: 5 3 10,12,13,17,20 2002,2008,2014
127: 5 3 10,12,14,17,20 2003,2008,2014
128: 5 3 11,12,13,16,17 2002,2012,2013
129: 5 3 11,12,13,17,20 2002,2012,2014
130: 5 3 12,13,15,16,19 2001,2002,2013
Matrix M3
:
system.time(final3 <- coded_best_intersect(M3))
user system elapsed
29.37 0.05 29.54
data.table::rbindlist(final3)[order(-M*N)]
M N ROWS YEARS
1: 6 7 1, 3, 8,15,20,29 1969,1973,1980,1984,1985,1992,...
2: 5 8 1, 3, 8,14,29 1969,1973,1976,1980,1984,1987,...
3: 5 8 1, 3, 8,20,29 1969,1973,1980,1984,1985,1992,...
4: 5 8 2, 7, 9,13,17 1974,1993,1994,2004,2012,2013,...
5: 5 8 3, 6, 8, 9,27 1974,1980,1984,1987,1995,1998,...
---
52374: 5 3 23,24,25,30,31 1979,1997,2002
52375: 5 3 23,25,28,30,31 1979,1992,2002
52376: 5 3 24,25,26,30,31 1983,1997,2002
52377: 5 3 24,25,28,30,31 1979,1983,2002
52378: 5 3 24,26,28,30,31 1983,1986,2002
To put the selected part of a result into a character string, you can do e.g. the following:
x <- data.table::rbindlist(final3)[order(-M*N)]
el(x$YEARS[1]) # select `YEARS` of result-row `1:`
# [1] "1969" "1973" "1980" "1984" "1985" "1992" "2003"
Note: See edit history for two other very different approaches. The first was melt
and join techniques which blew up the memory. The second approach was using RcppAlgos::comboGeneral
to evaluate a function.
This is a trivial problem using mixed integer programming and can be solved very quickly even with weak open source solver like glpk
. I am using ompr
package for mathematical modeling (more info on ompr) and have included the model logic as comments in code. Note that my random data is different than OP's due to different R versions I guess.
Total run time was around a minute (i.e. actual solve time is even less) for M3
when model set to maximize data for at most 15 years. This method will easily scale up for even larger instances.
library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)
set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA
m <- +!is.na(M3) # gets logical matrix; 0 if NA else 1
nr <- nrow(m)
nc <- ncol(m)
n_years <- 15
model <- MIPModel() %>%
# keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>%
# rm_row[i] is 1 if row i is selected for removal else 0
add_variable(rm_row[i], i = 1:nr, type = "binary") %>%
# rm_col[j] is 1 if column j is selected for removal else 0
add_variable(rm_col[j], j = 1:nc, type = "binary") %>%
# maximize good cells kept
set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>%
# cell can be kept only when row is not selected for removal
add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
# cell can be kept only when column is not selected for removal
add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
# only non-NA values can be kept
add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>%
# keep at most n_years columns i.e. remove at least (nc - n_years) columns
# I used >= instead of == to avoid infeasiblity
add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years) %>%
# solve using free glpk solver
solve_model(with_ROI(solver = "glpk"))
Results -
solver_status(model)
# [1] "optimal" <- indicates guaranteed optimum (at least one of the many possible)
# get rows to remove
rm_rows <- model %>%
get_solution(rm_row[i]) %>%
filter(value > 0) %>% pull(i) %>% print()
# [1] 1 2 3 4 6 8 9 11 12 13 14 15 17 18 19 20 21 22 23 25 27 28 29 30 31
# get columns to remove
rm_cols <- model %>%
get_solution(rm_col[j]) %>%
filter(value > 0) %>% pull(j) %>% print()
# [1] 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [24] 27 28 29 30 31 32 33 34 35 36 38 39 40 41 44 45 46 47 48
result <- M3[-rm_rows, -rm_cols, drop = F]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1969 1974 1994 2005 2010 2011
[2,] 1969 1974 1994 2005 2010 2011
[3,] 1969 1974 1994 2005 2010 2011
[4,] 1969 1974 1994 2005 2010 2011
[5,] 1969 1974 1994 2005 2010 2011
[6,] 1969 1974 1994 2005 2010 2011
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With