In a matrix, e.g. <code>M1</code>, rows are countries and columns are years. The countries don't have observations for the same years. I want to find the "best" intersect of years that gives me the most countries. The number of minimum years and minimum countries will be predefined. Which countries are included in the result doesn't matter, the years don't have to be consecutive. <pre class="prettyprint"><code>> M1 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [1,] NA NA NA 2004 NA 2006 NA 2008 2009 NA 2011 2012 NA NA NA [2,] NA 2002 NA 2004 NA NA 2007 NA NA 2010 2011 NA 2013 2014 NA [3,] NA NA NA 2004 2005 2006 2007 2008 2009 NA NA 2012 2013 NA 2015 [4,] NA 2002 NA 2004 2005 2006 2007 2008 NA 2010 2011 NA 2013 NA NA [5,] 2001 NA NA NA 2005 2006 2007 2008 NA 2010 NA 2012 2013 2014 NA [6,] 2001 NA 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 NA 2014 NA [7,] 2001 2002 NA NA 2005 NA 2007 NA 2009 NA 2011 NA NA 2014 2015 [8,] 2001 2002 NA 2004 2005 2006 NA NA NA 2010 NA NA 2013 NA 2015 [9,] NA 2002 NA 2004 2005 NA 2007 NA NA 2010 2011 NA NA NA NA [10,] 2001 2002 NA 2004 NA NA NA NA NA 2010 NA 2012 NA 2014 2015 </code></pre> Because there is no obvious intersect, a single <code>Reduce(intersect...)</code> attempt won't work, and I do that repeatedly by successively excluding one country up to the defined threshold <code>n.row</code>. The result is filtered for a minimum of years <code>n.col</code>. I wrote this function, <pre class="prettyprint"><code>findBestIntersect <- function(M, min.row=5, min.col=3) { ## min.row: minimum number of rows (countries) to analyze ## min.col: minimum number of complete columns (years) # put matrices with row combn into list (HUGE!) L1 <- lapply(min.row:(nrow(M) - 1), function(x) combn(nrow(M), x, function(i) M[i, ], simplify=FALSE)) # select lists w/ def. number of complete columns slc <- sapply(L1, function(y) # numbers of lists which(sapply(y, function(x) sum(!(apply(x, 2, function(i) any(is.na(i)))))) >= min.col)) # list selected lists L2 <- Map(function(x, i) x[i], L1[lengths(slc) > 0], slc[lengths(slc) > 0]) # find intersects L3 <- rapply(L2, function(l) as.integer(na.omit(Reduce(intersect, as.list(as.data.frame(t(l)))))), how="list") return(unique(unlist(L3, recursive=FALSE))) } </code></pre> which gives me the desired result for <code>M1</code> in no time. <pre class="prettyprint"><code>> system.time(best.yrs.1 <- findBestIntersect(M1)) user system elapsed 0.06 0.00 0.07 > best.yrs.1 [[1]] [1] 2002 2004 2010 </code></pre> However the performance for <code>M2</code> was only just acceptable (RAM usage around 1.1 GB), <pre class="prettyprint"><code>> system.time(best.yrs.2 <- findBestIntersect(M2)) user system elapsed 79.90 0.39 82.76 > head(best.yrs.2, 3) [[1]] [1] 2002 2009 2015 [[2]] [1] 2002 2014 2015 [[3]] [1] 2003 2009 2010 </code></pre> and you don't want to try this with <code>M3</code> (blasts 32 GB RAM) which resembles my real matrix: <pre class="prettyprint"><code># best.yrs.3 <- findBestIntersect(M3) </code></pre> Probably the biggest flaw of the function is that <code>L1</code> becomes too big very fast. So, my question is, would there be a better method that is also applicable to <code>M3</code>? The "bonus" would be to maximize both, countries and years. If possible I want to do this without additional packages. <h3>Data</h3> <pre class="prettyprint"><code>set.seed(42) tf <- matrix(sample(c(TRUE, FALSE), 150, replace=TRUE), 10) M1 <- t(replicate(10, 2001:2015, simplify=TRUE)) M1[tf] <- NA tf <- matrix(sample(c(TRUE, FALSE), 300, replace=TRUE), 20) M2 <- t(replicate(20, 2001:2015, simplify=TRUE)) M2[tf] <- NA tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31) M3 <- t(replicate(31, 1969:2016, simplify=TRUE)) M3[tf] <- NA </code></pre>

I wrote a <code>coded_best_intersect</code> function that relies on creating a for loop dynamically in a <code>code_maker</code> function. It evaluates <code>M3</code> in 30 seconds. Because the code generates a list, I am depending on <code>data.table</code> for <code>rbindlist</code> and the print method. <pre class="prettyprint"><code>library(data.table) </code></pre> <h3> <code>code_maker</code> function:</h3> <pre class="prettyprint"><code>code_maker <- function(non_NA_M, n, k, min.col) { ## initializing for results res <- list() z <- 1 ## initializing naming col_names <- colnames(non_NA_M) i_s <- paste0('i', seq_len(k)) ## create the foor loop text. It looks like this mostly ## for (i1 in 1:(n - k + 1)) { for (i2 in (i1 + 1):(n-k+2)) {}} for_loop <- paste0('for (', i_s, ' in ', c('1:', paste0('(', i_s[-k], ' + 1):')), n - k + seq_len(k), ')', ' {\n non_na_sums', seq_len(k), '=non_NA_M[', i_s, ', ] ', c('', paste0('& ', rep('non_na_sums', k - 1), seq_len(k)[-k])), '', '\n if (sum(non_na_sums', seq_len(k), ') < ', min.col, ') {next} ', collapse='\n') ## create the assignment back to the results which looks like ## res[[z]] <- data.table(M=k, N=sum(non_na_sumsk), ROWS=list(c(i1, i2, ..., ik)), ## YEARS=list(col_names[non_na_sumsk])) inner_text <- paste0('\nres[[z]] <- data.table(M=k, N=sum(non_na_sums', k, '), ROWS=list(c( ', paste0(i_s, collapse=', '), ')), YEARS=list(col_names[non_na_sums', k , ']))\nz <- z + 1') ## combines the loop parts and closes the for with }}} for_loop <- paste(for_loop, inner_text, paste0(rep('}', k), collapse='')) ## evaluate - the evaluation will assign back to res[[i]] eval(parse(text=for_loop)) res <- rbindlist(res) if (length(res) == 0) { #to return emtpy data.table with the correct fields return(data.table(M=integer(), N=integer(), ROWS=list(), YEARS=list())) } res$M <- k return(res) } </code></pre> <h3> <code>coded_best_intersect</code> function:</h3> <pre class="prettyprint"><code>coded_best_intersect <- function(M, min.row=5, min.col=3) { colnames(M) <- apply(M, 2, function(x) na.omit(x)[1]) n_row <- nrow(M) non_NA <- !is.na(M) n_combos <- min.row:(n_row - 1) res2 <- list() for (i in seq_along(n_combos)) { res2[[i]] <- code_maker(non_NA, n=n_row, k=n_combos[i], min.col) if (nrow(res2[[i]]) == 0) { break } } return(res2) } </code></pre> This is e.g. the code generated on the fly for <code>k=5</code>: <pre class="prettyprint"><code># for (i1 in 1:5) { # non_na_sums1=non_NA_M[i1, ] # if (sum(non_na_sums1) < 3) {next} # for (i2 in (i1 + 1):6) { # non_na_sums2=non_NA_M[i2, ] & non_na_sums1 # if (sum(non_na_sums2) < 3) {next} # for (i3 in (i2 + 1):7) { # non_na_sums3=non_NA_M[i3, ] & non_na_sums2 # if (sum(non_na_sums3) < 3) {next} # for (i4 in (i3 + 1):8) { # non_na_sums4=non_NA_M[i4, ] & non_na_sums3 # if (sum(non_na_sums4) < 3) {next} # for (i5 in (i4 + 1):9) { # non_na_sums5=non_NA_M[i5, ] & non_na_sums4 # if (sum(non_na_sums5) < 3) {next} # for (i6 in (i5 + 1):10) { # non_na_sums6=non_NA_M[i6, ] & non_na_sums5 # if (sum(non_na_sums6) < 3) {next} # res[[z]] <- data.table(M=k, N=sum(non_na_sums6), # ROWS=list(c( i1, i2, i3, i4, i5, i6)), # YEARS=list(col_names[non_na_sums6])) # z <- z + 1 }}}}}} </code></pre> You can likely notice the <code>{next}</code> which is a way for it to skip a combination if there's no possible way to get a minimum of 3 columns. And while it looks like it is all hard-coded, the code is actually a string generated, parsed, and then evaluated. <h3>Usage and Performance</h3> Matrix <code>M1</code>: <pre class="prettyprint"><code>system.time(final1 <- coded_best_intersect(M1)) user system elapsed 0 0 0 data.table::rbindlist(final1)[order(-M*N)] M N ROWS YEARS 1: 5 3 2, 4, 8, 9,10 2002,2004,2010 </code></pre> Matrix <code>M2</code>: <pre class="prettyprint"><code>system.time(final2 <- coded_best_intersect(M2)) user system elapsed 0.08 0.00 0.08 data.table::rbindlist(final2)[order(-M*N)] M N ROWS YEARS 1: 7 3 6, 8,11,12,13,16,... 2002,2012,2013 2: 5 4 6, 8,13,16,17 2002,2012,2013,2015 3: 5 4 8,11,12,13,17 2002,2012,2013,2014 4: 6 3 1, 4, 8,13,17,20 2002,2014,2015 5: 6 3 2, 5, 6,10,14,17 2003,2006,2008 --- 126: 5 3 10,12,13,17,20 2002,2008,2014 127: 5 3 10,12,14,17,20 2003,2008,2014 128: 5 3 11,12,13,16,17 2002,2012,2013 129: 5 3 11,12,13,17,20 2002,2012,2014 130: 5 3 12,13,15,16,19 2001,2002,2013 </code></pre> Matrix <code>M3</code>: <pre class="prettyprint"><code>system.time(final3 <- coded_best_intersect(M3)) user system elapsed 29.37 0.05 29.54 data.table::rbindlist(final3)[order(-M*N)] M N ROWS YEARS 1: 6 7 1, 3, 8,15,20,29 1969,1973,1980,1984,1985,1992,... 2: 5 8 1, 3, 8,14,29 1969,1973,1976,1980,1984,1987,... 3: 5 8 1, 3, 8,20,29 1969,1973,1980,1984,1985,1992,... 4: 5 8 2, 7, 9,13,17 1974,1993,1994,2004,2012,2013,... 5: 5 8 3, 6, 8, 9,27 1974,1980,1984,1987,1995,1998,... --- 52374: 5 3 23,24,25,30,31 1979,1997,2002 52375: 5 3 23,25,28,30,31 1979,1992,2002 52376: 5 3 24,25,26,30,31 1983,1997,2002 52377: 5 3 24,25,28,30,31 1979,1983,2002 52378: 5 3 24,26,28,30,31 1983,1986,2002 </code></pre> To put the selected part of a result into a character string, you can do e.g. the following: <pre class="prettyprint"><code>x <- data.table::rbindlist(final3)[order(-M*N)] el(x$YEARS[1]) # select `YEARS` of result-row `1:` # [1] "1969" "1973" "1980" "1984" "1985" "1992" "2003" </code></pre> <hr> Note: See edit history for two other very different approaches. The first was <code>melt</code> and join techniques which blew up the memory. The second approach was using <code>RcppAlgos::comboGeneral</code> to evaluate a function.

This is a trivial problem using mixed integer programming and can be solved very quickly even with weak open source solver like <code>glpk</code>. I am using <code>ompr</code> package for mathematical modeling (more info on ompr) and have included the model logic as comments in code. Note that my random data is different than OP's due to different R versions I guess. Total run time was around a minute (i.e. actual solve time is even less) for <code>M3</code> when model set to maximize data for at most 15 years. This method will easily scale up for even larger instances. <pre class="prettyprint"><code>library(dplyr) library(ROI) library(ROI.plugin.glpk) library(ompr) library(ompr.roi) set.seed(42) tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31) M3 <- t(replicate(31, 1969:2016, simplify=TRUE)) M3[tf] <- NA m <- +!is.na(M3) # gets logical matrix; 0 if NA else 1 nr <- nrow(m) nc <- ncol(m) n_years <- 15 model <- MIPModel() %>% # keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0 add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>% # rm_row[i] is 1 if row i is selected for removal else 0 add_variable(rm_row[i], i = 1:nr, type = "binary") %>% # rm_col[j] is 1 if column j is selected for removal else 0 add_variable(rm_col[j], j = 1:nc, type = "binary") %>% # maximize good cells kept set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>% # cell can be kept only when row is not selected for removal add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>% # cell can be kept only when column is not selected for removal add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>% # only non-NA values can be kept add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>% # keep at most n_years columns i.e. remove at least (nc - n_years) columns # I used >= instead of == to avoid infeasiblity add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years) %>% # solve using free glpk solver solve_model(with_ROI(solver = "glpk")) </code></pre> Results - <pre class="prettyprint"><code>solver_status(model) # [1] "optimal" <- indicates guaranteed optimum (at least one of the many possible) # get rows to remove rm_rows <- model %>% get_solution(rm_row[i]) %>% filter(value > 0) %>% pull(i) %>% print() # [1] 1 2 3 4 6 8 9 11 12 13 14 15 17 18 19 20 21 22 23 25 27 28 29 30 31 # get columns to remove rm_cols <- model %>% get_solution(rm_col[j]) %>% filter(value > 0) %>% pull(j) %>% print() # [1] 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 # [24] 27 28 29 30 31 32 33 34 35 36 38 39 40 41 44 45 46 47 48 result <- M3[-rm_rows, -rm_cols, drop = F] [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1969 1974 1994 2005 2010 2011 [2,] 1969 1974 1994 2005 2010 2011 [3,] 1969 1974 1994 2005 2010 2011 [4,] 1969 1974 1994 2005 2010 2011 [5,] 1969 1974 1994 2005 2010 2011 [6,] 1969 1974 1994 2005 2010 2011 </code></pre>

How to optimize intersect of rows and columns in a matrix?

Tags:

performance

algorithm

r

intersection

In a matrix, e.g. M1, rows are countries and columns are years. The countries don't have observations for the same years. I want to find the "best" intersect of years that gives me the most countries. The number of minimum years and minimum countries will be predefined. Which countries are included in the result doesn't matter, the years don't have to be consecutive.

> M1
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
 [1,]   NA   NA   NA 2004   NA 2006   NA 2008 2009    NA  2011  2012    NA    NA    NA
 [2,]   NA 2002   NA 2004   NA   NA 2007   NA   NA  2010  2011    NA  2013  2014    NA
 [3,]   NA   NA   NA 2004 2005 2006 2007 2008 2009    NA    NA  2012  2013    NA  2015
 [4,]   NA 2002   NA 2004 2005 2006 2007 2008   NA  2010  2011    NA  2013    NA    NA
 [5,] 2001   NA   NA   NA 2005 2006 2007 2008   NA  2010    NA  2012  2013  2014    NA
 [6,] 2001   NA 2003 2004 2005 2006 2007 2008 2009  2010  2011  2012    NA  2014    NA
 [7,] 2001 2002   NA   NA 2005   NA 2007   NA 2009    NA  2011    NA    NA  2014  2015
 [8,] 2001 2002   NA 2004 2005 2006   NA   NA   NA  2010    NA    NA  2013    NA  2015
 [9,]   NA 2002   NA 2004 2005   NA 2007   NA   NA  2010  2011    NA    NA    NA    NA
[10,] 2001 2002   NA 2004   NA   NA   NA   NA   NA  2010    NA  2012    NA  2014  2015

Because there is no obvious intersect, a single Reduce(intersect...) attempt won't work, and I do that repeatedly by successively excluding one country up to the defined threshold n.row. The result is filtered for a minimum of years n.col. I wrote this function,

findBestIntersect <- function(M, min.row=5, min.col=3) {
  ## min.row: minimum number of rows (countries) to analyze
  ## min.col: minimum number of complete columns (years)
  # put matrices with row combn into list (HUGE!)
  L1 <- lapply(min.row:(nrow(M) - 1), function(x)
    combn(nrow(M), x, function(i) M[i, ], simplify=FALSE))
  # select lists w/ def. number of complete columns
  slc <- sapply(L1, function(y)  # numbers of lists
    which(sapply(y, function(x)
      sum(!(apply(x, 2, function(i) any(is.na(i))))))
      >= min.col))
  # list selected lists
  L2 <- Map(function(x, i)
    x[i], L1[lengths(slc) > 0], slc[lengths(slc) > 0])
  # find intersects
  L3 <- rapply(L2, function(l)
    as.integer(na.omit(Reduce(intersect, as.list(as.data.frame(t(l)))))),
    how="list")
  return(unique(unlist(L3, recursive=FALSE)))
}

which gives me the desired result for M1 in no time.

> system.time(best.yrs.1 <- findBestIntersect(M1))
   user  system elapsed 
   0.06    0.00    0.07 

> best.yrs.1
[[1]]
[1] 2002 2004 2010

However the performance for M2 was only just acceptable (RAM usage around 1.1 GB),

> system.time(best.yrs.2 <- findBestIntersect(M2))
   user  system elapsed 
  79.90    0.39   82.76 
> head(best.yrs.2, 3)
[[1]]
[1] 2002 2009 2015

[[2]]
[1] 2002 2014 2015

[[3]]
[1] 2003 2009 2010

and you don't want to try this with M3 (blasts 32 GB RAM) which resembles my real matrix:

# best.yrs.3 <- findBestIntersect(M3)

Probably the biggest flaw of the function is that L1 becomes too big very fast.

So, my question is, would there be a better method that is also applicable to M3? The "bonus" would be to maximize both, countries and years. If possible I want to do this without additional packages.

Data

set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 150, replace=TRUE), 10)
M1 <- t(replicate(10, 2001:2015, simplify=TRUE))
M1[tf] <- NA

tf <- matrix(sample(c(TRUE, FALSE), 300, replace=TRUE), 20)
M2 <- t(replicate(20, 2001:2015, simplify=TRUE))
M2[tf] <- NA

tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA

902

asked Aug 17 '19 07:08

jay.sf

2 Answers

I wrote a coded_best_intersect function that relies on creating a for loop dynamically in a code_maker function. It evaluates M3 in 30 seconds. Because the code generates a list, I am depending on data.table for rbindlist and the print method.

library(data.table)

`code_maker` function:

code_maker <- function(non_NA_M, n, k, min.col) {
  ## initializing for results
  res <- list()
  z <- 1
  ## initializing naming
  col_names <- colnames(non_NA_M)
  i_s <- paste0('i', seq_len(k))
  ## create the foor loop text. It looks like this mostly
  ## for (i1 in 1:(n - k + 1)) { for (i2 in (i1 + 1):(n-k+2)) {}}
  for_loop <- paste0('for (', i_s, ' in ', c('1:', paste0('(', i_s[-k], ' + 1):')), 
                     n - k + seq_len(k), ')', ' {\n non_na_sums', seq_len(k), 
                     '=non_NA_M[', i_s, ', ] ',
                     c('', paste0('& ', rep('non_na_sums', k - 1), seq_len(k)[-k])), '', 
                     '\n if (sum(non_na_sums', seq_len(k), ') < ', min.col, ') {next} ', 
                     collapse='\n')
  ## create the assignment back to the results which looks like
  ## res[[z]] <- data.table(M=k, N=sum(non_na_sumsk), ROWS=list(c(i1, i2, ..., ik)), 
  ##                        YEARS=list(col_names[non_na_sumsk]))
  inner_text <- paste0('\nres[[z]] <- data.table(M=k, N=sum(non_na_sums',
                       k, '), ROWS=list(c( ', paste0(i_s, collapse=', '), 
                       ')), YEARS=list(col_names[non_na_sums', k , ']))\nz <- z + 1')
  ## combines the loop parts and closes the for with }}}
  for_loop <- paste(for_loop, 
                    inner_text, 
                    paste0(rep('}', k), collapse=''))
  ## evaluate - the evaluation will assign back to res[[i]]  
  eval(parse(text=for_loop))
  res <- rbindlist(res)
  if (length(res) == 0) { #to return emtpy data.table with the correct fields
    return(data.table(M=integer(), N=integer(), ROWS=list(), YEARS=list()))
  }
  res$M <- k
  return(res)
}

`coded_best_intersect` function:

coded_best_intersect <- function(M, min.row=5, min.col=3) {
  colnames(M) <- apply(M, 2, function(x) na.omit(x)[1])
  n_row <- nrow(M)
  non_NA <- !is.na(M)
  n_combos <- min.row:(n_row - 1)
  res2 <- list()
  for (i in seq_along(n_combos)) {
    res2[[i]] <- code_maker(non_NA, n=n_row, k=n_combos[i], min.col)
    if (nrow(res2[[i]]) == 0) {
      break
    }
  }
  return(res2)
}

This is e.g. the code generated on the fly for k=5:

# for (i1 in 1:5) {
#   non_na_sums1=non_NA_M[i1, ] 
#   if (sum(non_na_sums1) < 3) {next} 
#   for (i2 in (i1 + 1):6) {
#     non_na_sums2=non_NA_M[i2, ] & non_na_sums1
#     if (sum(non_na_sums2) < 3) {next} 
#     for (i3 in (i2 + 1):7) {
#       non_na_sums3=non_NA_M[i3, ] & non_na_sums2
#       if (sum(non_na_sums3) < 3) {next} 
#       for (i4 in (i3 + 1):8) {
#         non_na_sums4=non_NA_M[i4, ] & non_na_sums3
#         if (sum(non_na_sums4) < 3) {next} 
#         for (i5 in (i4 + 1):9) {
#           non_na_sums5=non_NA_M[i5, ] & non_na_sums4
#           if (sum(non_na_sums5) < 3) {next} 
#           for (i6 in (i5 + 1):10) {
#             non_na_sums6=non_NA_M[i6, ] & non_na_sums5
#             if (sum(non_na_sums6) < 3) {next}  
#             res[[z]] <- data.table(M=k, N=sum(non_na_sums6), 
#                                    ROWS=list(c( i1, i2, i3, i4, i5, i6)),
#                                    YEARS=list(col_names[non_na_sums6]))
#             z <- z + 1 }}}}}}

You can likely notice the {next} which is a way for it to skip a combination if there's no possible way to get a minimum of 3 columns. And while it looks like it is all hard-coded, the code is actually a string generated, parsed, and then evaluated.

Usage and Performance

Matrix M1:

system.time(final1 <- coded_best_intersect(M1))
   user  system elapsed 
      0       0       0 
data.table::rbindlist(final1)[order(-M*N)]
   M N           ROWS          YEARS
1: 5 3  2, 4, 8, 9,10 2002,2004,2010

Matrix M2:

system.time(final2 <- coded_best_intersect(M2))
   user  system elapsed 
   0.08    0.00    0.08 
data.table::rbindlist(final2)[order(-M*N)]
     M N                  ROWS               YEARS
  1: 7 3  6, 8,11,12,13,16,...      2002,2012,2013
  2: 5 4         6, 8,13,16,17 2002,2012,2013,2015
  3: 5 4         8,11,12,13,17 2002,2012,2013,2014
  4: 6 3      1, 4, 8,13,17,20      2002,2014,2015
  5: 6 3      2, 5, 6,10,14,17      2003,2006,2008
 ---                                              
126: 5 3        10,12,13,17,20      2002,2008,2014
127: 5 3        10,12,14,17,20      2003,2008,2014
128: 5 3        11,12,13,16,17      2002,2012,2013
129: 5 3        11,12,13,17,20      2002,2012,2014
130: 5 3        12,13,15,16,19      2001,2002,2013

Matrix M3:

system.time(final3 <- coded_best_intersect(M3))
   user  system elapsed 
  29.37    0.05   29.54 
data.table::rbindlist(final3)[order(-M*N)]
       M N              ROWS                             YEARS
    1: 6 7  1, 3, 8,15,20,29 1969,1973,1980,1984,1985,1992,...
    2: 5 8     1, 3, 8,14,29 1969,1973,1976,1980,1984,1987,...
    3: 5 8     1, 3, 8,20,29 1969,1973,1980,1984,1985,1992,...
    4: 5 8     2, 7, 9,13,17 1974,1993,1994,2004,2012,2013,...
    5: 5 8     3, 6, 8, 9,27 1974,1980,1984,1987,1995,1998,...
   ---                                                        
52374: 5 3    23,24,25,30,31                    1979,1997,2002
52375: 5 3    23,25,28,30,31                    1979,1992,2002
52376: 5 3    24,25,26,30,31                    1983,1997,2002
52377: 5 3    24,25,28,30,31                    1979,1983,2002
52378: 5 3    24,26,28,30,31                    1983,1986,2002

To put the selected part of a result into a character string, you can do e.g. the following:

x <- data.table::rbindlist(final3)[order(-M*N)]
el(x$YEARS[1])  # select `YEARS` of result-row `1:`
# [1] "1969" "1973" "1980" "1984" "1985" "1992" "2003"

Note: See edit history for two other very different approaches. The first was melt and join techniques which blew up the memory. The second approach was using RcppAlgos::comboGeneral to evaluate a function.

answered Sep 30 '22 22:09

Cole

This is a trivial problem using mixed integer programming and can be solved very quickly even with weak open source solver like glpk. I am using ompr package for mathematical modeling (more info on ompr) and have included the model logic as comments in code. Note that my random data is different than OP's due to different R versions I guess.

Total run time was around a minute (i.e. actual solve time is even less) for M3 when model set to maximize data for at most 15 years. This method will easily scale up for even larger instances.

library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)

set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA

m <- +!is.na(M3) # gets logical matrix; 0 if NA else 1    
nr <- nrow(m)
nc <- ncol(m)
n_years <- 15 

model <- MIPModel() %>% 
  # keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
  add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>% 
  # rm_row[i] is 1 if row i is selected for removal else 0
  add_variable(rm_row[i], i = 1:nr, type = "binary") %>% 
  # rm_col[j] is 1 if column j is selected for removal else 0
  add_variable(rm_col[j], j = 1:nc, type = "binary") %>% 
  # maximize good cells kept
  set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>% 
  # cell can be kept only when row is not selected for removal
  add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
  # cell can be kept only when column is not selected for removal
  add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
  # only non-NA values can be kept
  add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>% 
  # keep at most n_years columns i.e. remove at least (nc - n_years) columns
  # I used >= instead of == to avoid infeasiblity
  add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years) %>% 
  # solve using free glpk solver
  solve_model(with_ROI(solver = "glpk"))

Results -

solver_status(model)
# [1] "optimal"    <- indicates guaranteed optimum (at least one of the many possible)

# get rows to remove
rm_rows <- model %>% 
  get_solution(rm_row[i]) %>% 
  filter(value > 0) %>% pull(i) %>% print()

# [1]  1  2  3  4  6  8  9 11 12 13 14 15 17 18 19 20 21 22 23 25 27 28 29 30 31

# get columns to remove
rm_cols <- model %>% 
  get_solution(rm_col[j]) %>% 
  filter(value > 0) %>% pull(j) %>% print()

# [1]  2  3  4  5  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [24] 27 28 29 30 31 32 33 34 35 36 38 39 40 41 44 45 46 47 48

result <- M3[-rm_rows, -rm_cols, drop = F]

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1969 1974 1994 2005 2010 2011
[2,] 1969 1974 1994 2005 2010 2011
[3,] 1969 1974 1994 2005 2010 2011
[4,] 1969 1974 1994 2005 2010 2011
[5,] 1969 1974 1994 2005 2010 2011
[6,] 1969 1974 1994 2005 2010 2011

answered Sep 30 '22 23:09

Shree

Related questions
                            
                                udunits2 R install: udunits2.h not found
                            
                                Have downloadButton work with observeEvent
                            
                                Stacked bar chart with varying widths in ggplot
                            
                                What does Continuous x aesthetic -- did you forget aes(group=...) mean?
                            
                                How to create zipcode boundaries in R
                            
                                what does %*%+ mean in matrix operations?
                            
                                How to list all S3 methods defined in a specific package / namespace for a particular generic function
                            
                                Add multiple output variables using purrr and a predefined function
                            
                                Find initial conditions for nonlinear models using the nlsLM function
                            
                                Why does tensorflow/keras choke when I try to fit multiple models in parallel?
                            
                                Cross-references in RMarkdown for Word documents
                            
                                Using R-markdown knitr hooks to custom format tables in HTML reports
                            
                                Partial Row Labels Heatmap - R
                            
                                covr shows 0% of coverage while all tests with testthat pass
                            
                                Extract filters from R Shiny Datatable
                            
                                Convert a "loadings" object to a dataframe (R)
                            
                                R lubridate ymd_hms millisecond diff
                            
                                Split & extract part of string (between a "." and digit) in R
                            
                                Collapse error message into a single block when the error message is modified to print in red
                            
                                R: fast (conditional) subsetting where feasible

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to optimize intersect of rows and columns in a matrix?

Tags:

performance

algorithm

r

intersection

Data

jay.sf

People also ask

2 Answers

`code_maker` function:

`coded_best_intersect` function:

Usage and Performance

Cole

Shree

Recent Activity

Donate For Us

How to optimize intersect of rows and columns in a matrix?

Tags:

performance

algorithm

r

intersection

Data

jay.sf

People also ask

2 Answers

code_maker function:

coded_best_intersect function:

Usage and Performance

Cole

Shree

Related questions

Recent Activity

Donate For Us

`code_maker` function:

`coded_best_intersect` function: