Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize intersect of rows and columns in a matrix?

In a matrix, e.g. M1, rows are countries and columns are years. The countries don't have observations for the same years. I want to find the "best" intersect of years that gives me the most countries. The number of minimum years and minimum countries will be predefined. Which countries are included in the result doesn't matter, the years don't have to be consecutive.

> M1
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
 [1,]   NA   NA   NA 2004   NA 2006   NA 2008 2009    NA  2011  2012    NA    NA    NA
 [2,]   NA 2002   NA 2004   NA   NA 2007   NA   NA  2010  2011    NA  2013  2014    NA
 [3,]   NA   NA   NA 2004 2005 2006 2007 2008 2009    NA    NA  2012  2013    NA  2015
 [4,]   NA 2002   NA 2004 2005 2006 2007 2008   NA  2010  2011    NA  2013    NA    NA
 [5,] 2001   NA   NA   NA 2005 2006 2007 2008   NA  2010    NA  2012  2013  2014    NA
 [6,] 2001   NA 2003 2004 2005 2006 2007 2008 2009  2010  2011  2012    NA  2014    NA
 [7,] 2001 2002   NA   NA 2005   NA 2007   NA 2009    NA  2011    NA    NA  2014  2015
 [8,] 2001 2002   NA 2004 2005 2006   NA   NA   NA  2010    NA    NA  2013    NA  2015
 [9,]   NA 2002   NA 2004 2005   NA 2007   NA   NA  2010  2011    NA    NA    NA    NA
[10,] 2001 2002   NA 2004   NA   NA   NA   NA   NA  2010    NA  2012    NA  2014  2015

Because there is no obvious intersect, a single Reduce(intersect...) attempt won't work, and I do that repeatedly by successively excluding one country up to the defined threshold n.row. The result is filtered for a minimum of years n.col. I wrote this function,

findBestIntersect <- function(M, min.row=5, min.col=3) {
  ## min.row: minimum number of rows (countries) to analyze
  ## min.col: minimum number of complete columns (years)
  # put matrices with row combn into list (HUGE!)
  L1 <- lapply(min.row:(nrow(M) - 1), function(x)
    combn(nrow(M), x, function(i) M[i, ], simplify=FALSE))
  # select lists w/ def. number of complete columns
  slc <- sapply(L1, function(y)  # numbers of lists
    which(sapply(y, function(x)
      sum(!(apply(x, 2, function(i) any(is.na(i))))))
      >= min.col))
  # list selected lists
  L2 <- Map(function(x, i)
    x[i], L1[lengths(slc) > 0], slc[lengths(slc) > 0])
  # find intersects
  L3 <- rapply(L2, function(l)
    as.integer(na.omit(Reduce(intersect, as.list(as.data.frame(t(l)))))),
    how="list")
  return(unique(unlist(L3, recursive=FALSE)))
}

which gives me the desired result for M1 in no time.

> system.time(best.yrs.1 <- findBestIntersect(M1))
   user  system elapsed 
   0.06    0.00    0.07 

> best.yrs.1
[[1]]
[1] 2002 2004 2010

However the performance for M2 was only just acceptable (RAM usage around 1.1 GB),

> system.time(best.yrs.2 <- findBestIntersect(M2))
   user  system elapsed 
  79.90    0.39   82.76 
> head(best.yrs.2, 3)
[[1]]
[1] 2002 2009 2015

[[2]]
[1] 2002 2014 2015

[[3]]
[1] 2003 2009 2010

and you don't want to try this with M3 (blasts 32 GB RAM) which resembles my real matrix:

# best.yrs.3 <- findBestIntersect(M3)

Probably the biggest flaw of the function is that L1 becomes too big very fast.

So, my question is, would there be a better method that is also applicable to M3? The "bonus" would be to maximize both, countries and years. If possible I want to do this without additional packages.

Data

set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 150, replace=TRUE), 10)
M1 <- t(replicate(10, 2001:2015, simplify=TRUE))
M1[tf] <- NA

tf <- matrix(sample(c(TRUE, FALSE), 300, replace=TRUE), 20)
M2 <- t(replicate(20, 2001:2015, simplify=TRUE))
M2[tf] <- NA

tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA
like image 902
jay.sf Avatar asked Aug 17 '19 07:08

jay.sf


People also ask

How do you interchange rows and columns in a matrix?

Any 2 columns (or rows) of a matrix can be exchanged. If the ith and jth rows are exchanged, it is shown by Ri ↔ Rj and if the ith and jth columns are exchanged, it is shown by Ci ↔ Cj.

What is the intersection between rows and columns?

A cell is the intersection point of a column and a row.


2 Answers

I wrote a coded_best_intersect function that relies on creating a for loop dynamically in a code_maker function. It evaluates M3 in 30 seconds. Because the code generates a list, I am depending on data.table for rbindlist and the print method.

library(data.table)

code_maker function:

code_maker <- function(non_NA_M, n, k, min.col) {
  ## initializing for results
  res <- list()
  z <- 1
  ## initializing naming
  col_names <- colnames(non_NA_M)
  i_s <- paste0('i', seq_len(k))
  ## create the foor loop text. It looks like this mostly
  ## for (i1 in 1:(n - k + 1)) { for (i2 in (i1 + 1):(n-k+2)) {}}
  for_loop <- paste0('for (', i_s, ' in ', c('1:', paste0('(', i_s[-k], ' + 1):')), 
                     n - k + seq_len(k), ')', ' {\n non_na_sums', seq_len(k), 
                     '=non_NA_M[', i_s, ', ] ',
                     c('', paste0('& ', rep('non_na_sums', k - 1), seq_len(k)[-k])), '', 
                     '\n if (sum(non_na_sums', seq_len(k), ') < ', min.col, ') {next} ', 
                     collapse='\n')
  ## create the assignment back to the results which looks like
  ## res[[z]] <- data.table(M=k, N=sum(non_na_sumsk), ROWS=list(c(i1, i2, ..., ik)), 
  ##                        YEARS=list(col_names[non_na_sumsk]))
  inner_text <- paste0('\nres[[z]] <- data.table(M=k, N=sum(non_na_sums',
                       k, '), ROWS=list(c( ', paste0(i_s, collapse=', '), 
                       ')), YEARS=list(col_names[non_na_sums', k , ']))\nz <- z + 1')
  ## combines the loop parts and closes the for with }}}
  for_loop <- paste(for_loop, 
                    inner_text, 
                    paste0(rep('}', k), collapse=''))
  ## evaluate - the evaluation will assign back to res[[i]]  
  eval(parse(text=for_loop))
  res <- rbindlist(res)
  if (length(res) == 0) { #to return emtpy data.table with the correct fields
    return(data.table(M=integer(), N=integer(), ROWS=list(), YEARS=list()))
  }
  res$M <- k
  return(res)
}

coded_best_intersect function:

coded_best_intersect <- function(M, min.row=5, min.col=3) {
  colnames(M) <- apply(M, 2, function(x) na.omit(x)[1])
  n_row <- nrow(M)
  non_NA <- !is.na(M)
  n_combos <- min.row:(n_row - 1)
  res2 <- list()
  for (i in seq_along(n_combos)) {
    res2[[i]] <- code_maker(non_NA, n=n_row, k=n_combos[i], min.col)
    if (nrow(res2[[i]]) == 0) {
      break
    }
  }
  return(res2)
}

This is e.g. the code generated on the fly for k=5:

# for (i1 in 1:5) {
#   non_na_sums1=non_NA_M[i1, ] 
#   if (sum(non_na_sums1) < 3) {next} 
#   for (i2 in (i1 + 1):6) {
#     non_na_sums2=non_NA_M[i2, ] & non_na_sums1
#     if (sum(non_na_sums2) < 3) {next} 
#     for (i3 in (i2 + 1):7) {
#       non_na_sums3=non_NA_M[i3, ] & non_na_sums2
#       if (sum(non_na_sums3) < 3) {next} 
#       for (i4 in (i3 + 1):8) {
#         non_na_sums4=non_NA_M[i4, ] & non_na_sums3
#         if (sum(non_na_sums4) < 3) {next} 
#         for (i5 in (i4 + 1):9) {
#           non_na_sums5=non_NA_M[i5, ] & non_na_sums4
#           if (sum(non_na_sums5) < 3) {next} 
#           for (i6 in (i5 + 1):10) {
#             non_na_sums6=non_NA_M[i6, ] & non_na_sums5
#             if (sum(non_na_sums6) < 3) {next}  
#             res[[z]] <- data.table(M=k, N=sum(non_na_sums6), 
#                                    ROWS=list(c( i1, i2, i3, i4, i5, i6)),
#                                    YEARS=list(col_names[non_na_sums6]))
#             z <- z + 1 }}}}}}

You can likely notice the {next} which is a way for it to skip a combination if there's no possible way to get a minimum of 3 columns. And while it looks like it is all hard-coded, the code is actually a string generated, parsed, and then evaluated.

Usage and Performance

Matrix M1:

system.time(final1 <- coded_best_intersect(M1))
   user  system elapsed 
      0       0       0 
data.table::rbindlist(final1)[order(-M*N)]
   M N           ROWS          YEARS
1: 5 3  2, 4, 8, 9,10 2002,2004,2010

Matrix M2:

system.time(final2 <- coded_best_intersect(M2))
   user  system elapsed 
   0.08    0.00    0.08 
data.table::rbindlist(final2)[order(-M*N)]
     M N                  ROWS               YEARS
  1: 7 3  6, 8,11,12,13,16,...      2002,2012,2013
  2: 5 4         6, 8,13,16,17 2002,2012,2013,2015
  3: 5 4         8,11,12,13,17 2002,2012,2013,2014
  4: 6 3      1, 4, 8,13,17,20      2002,2014,2015
  5: 6 3      2, 5, 6,10,14,17      2003,2006,2008
 ---                                              
126: 5 3        10,12,13,17,20      2002,2008,2014
127: 5 3        10,12,14,17,20      2003,2008,2014
128: 5 3        11,12,13,16,17      2002,2012,2013
129: 5 3        11,12,13,17,20      2002,2012,2014
130: 5 3        12,13,15,16,19      2001,2002,2013

Matrix M3:

system.time(final3 <- coded_best_intersect(M3))
   user  system elapsed 
  29.37    0.05   29.54 
data.table::rbindlist(final3)[order(-M*N)]
       M N              ROWS                             YEARS
    1: 6 7  1, 3, 8,15,20,29 1969,1973,1980,1984,1985,1992,...
    2: 5 8     1, 3, 8,14,29 1969,1973,1976,1980,1984,1987,...
    3: 5 8     1, 3, 8,20,29 1969,1973,1980,1984,1985,1992,...
    4: 5 8     2, 7, 9,13,17 1974,1993,1994,2004,2012,2013,...
    5: 5 8     3, 6, 8, 9,27 1974,1980,1984,1987,1995,1998,...
   ---                                                        
52374: 5 3    23,24,25,30,31                    1979,1997,2002
52375: 5 3    23,25,28,30,31                    1979,1992,2002
52376: 5 3    24,25,26,30,31                    1983,1997,2002
52377: 5 3    24,25,28,30,31                    1979,1983,2002
52378: 5 3    24,26,28,30,31                    1983,1986,2002

To put the selected part of a result into a character string, you can do e.g. the following:

x <- data.table::rbindlist(final3)[order(-M*N)]
el(x$YEARS[1])  # select `YEARS` of result-row `1:`
# [1] "1969" "1973" "1980" "1984" "1985" "1992" "2003"

Note: See edit history for two other very different approaches. The first was melt and join techniques which blew up the memory. The second approach was using RcppAlgos::comboGeneral to evaluate a function.

like image 51
Cole Avatar answered Sep 30 '22 22:09

Cole


This is a trivial problem using mixed integer programming and can be solved very quickly even with weak open source solver like glpk. I am using ompr package for mathematical modeling (more info on ompr) and have included the model logic as comments in code. Note that my random data is different than OP's due to different R versions I guess.

Total run time was around a minute (i.e. actual solve time is even less) for M3 when model set to maximize data for at most 15 years. This method will easily scale up for even larger instances.

library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)

set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA

m <- +!is.na(M3) # gets logical matrix; 0 if NA else 1    
nr <- nrow(m)
nc <- ncol(m)
n_years <- 15 

model <- MIPModel() %>% 
  # keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
  add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>% 
  # rm_row[i] is 1 if row i is selected for removal else 0
  add_variable(rm_row[i], i = 1:nr, type = "binary") %>% 
  # rm_col[j] is 1 if column j is selected for removal else 0
  add_variable(rm_col[j], j = 1:nc, type = "binary") %>% 
  # maximize good cells kept
  set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>% 
  # cell can be kept only when row is not selected for removal
  add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
  # cell can be kept only when column is not selected for removal
  add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
  # only non-NA values can be kept
  add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>% 
  # keep at most n_years columns i.e. remove at least (nc - n_years) columns
  # I used >= instead of == to avoid infeasiblity
  add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years) %>% 
  # solve using free glpk solver
  solve_model(with_ROI(solver = "glpk"))

Results -

solver_status(model)
# [1] "optimal"    <- indicates guaranteed optimum (at least one of the many possible)

# get rows to remove
rm_rows <- model %>% 
  get_solution(rm_row[i]) %>% 
  filter(value > 0) %>% pull(i) %>% print()

# [1]  1  2  3  4  6  8  9 11 12 13 14 15 17 18 19 20 21 22 23 25 27 28 29 30 31

# get columns to remove
rm_cols <- model %>% 
  get_solution(rm_col[j]) %>% 
  filter(value > 0) %>% pull(j) %>% print()

# [1]  2  3  4  5  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [24] 27 28 29 30 31 32 33 34 35 36 38 39 40 41 44 45 46 47 48

result <- M3[-rm_rows, -rm_cols, drop = F]

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1969 1974 1994 2005 2010 2011
[2,] 1969 1974 1994 2005 2010 2011
[3,] 1969 1974 1994 2005 2010 2011
[4,] 1969 1974 1994 2005 2010 2011
[5,] 1969 1974 1994 2005 2010 2011
[6,] 1969 1974 1994 2005 2010 2011
like image 43
Shree Avatar answered Sep 30 '22 23:09

Shree