Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently picking combinations of Integers

Let's say we have a 5x5 matrix, filled with 0s.

myMatrix <- matrix(rep(0, 25), ncol = 5)

Now, let's pick a triplet of integers between 1 and 5.

triplet <- c(1,2,3)

For all combinations of this triplet we now add 1 in the matrix, with this function:

addCombinationsToMatrix <- function(.matrix, .triplet){
    indexesToChange <- as.matrix(expand.grid(.triplet, .triplet))
    .matrix[indexesToChange] <- .matrix[indexesToChange] + 1
    .matrix
}

Using the function, we go from

myMatrix

     [,1] [,2] [,3] [,4] [,5]
[1,]    0    0    0    0    0
[2,]    0    0    0    0    0
[3,]    0    0    0    0    0
[4,]    0    0    0    0    0
[5,]    0    0    0    0    0

to

myMatrix <- addCombinationsToMatrix(myMatrix, triplet)
myMatrix

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    0    0
[2,]    1    1    1    0    0
[3,]    1    1    1    0    0
[4,]    0    0    0    0    0
[5,]    0    0    0    0    0

If we pick another triplet we move on to

nextTriplet <- 2:4
myMatrix <- addCombinationsToMatrix(myMatrix, nextTriplet)
myMatrix

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    0    0
[2,]    1    2    2    1    0
[3,]    1    2    2    1    0
[4,]    0    1    1    1    0
[5,]    0    0    0    0    0

So, row-column combinations represent how often two integers have been shown together in a triplet: 3 and 4 have been shown together once, 2 and 3 have been shown together twice.

Question: How can one pick triplets, such that every combination (1-2, 1-3, 1-4...) was picked at least once and the number of triplets is minimized.

I'm looking for an algorithm here that picks the next triplet.

Ideally it can be extended to

  • arbitrarily big matrices (10x10, 100x100 ...)
  • arbitrarily big vectors (quadruplets, quintuplets, n-tuplets)
  • an arbitrary number of times a combination must have been picked at least

Example:

myMatrix
myMatrix <- addCombinationsToMatrix(myMatrix, 1:3)
myMatrix
myMatrix <- addCombinationsToMatrix(myMatrix, 3:5)
myMatrix
myMatrix <- addCombinationsToMatrix(myMatrix, c(1,4,5))
myMatrix
myMatrix <- addCombinationsToMatrix(myMatrix, c(2,4,5))
myMatrix

EDIT: Just to be sure: the answer doesn't have to be R code. It can be some other language as well or even pseudo code.

EDIT 2: It occured to me now, that there are different ways of measuring efficiency. I actually meant, the algorithm should take as little iterations as possible. The algorithm being fast is also very cool, but not the main goal here.

like image 494
Georgery Avatar asked Feb 12 '20 09:02

Georgery


2 Answers

Great question! This comes up in survey design, where you want a few different versions of the survey that each only contain a subset of the questions, but you want every pair (or t-tuple) of questions to have been asked at least once.

This is called covering design, and is a variant of the classic set cover problem. As you can read in an excellent Mathematics Stack Exchange post on the topic, folks use notation C(v, k, t) indicating the minimum number of k-element subsets you need to draw (k=3 in your case) from a v-element set (v=5 in your case) such that every t-element subset in the entire set (t=2 in your case) is contained within one of your selected subsets. Folks have evaluated this function for many different (v, k, t) tuples; see, for instance, https://ljcr.dmgordon.org/cover/table.html . We can read from that table that C(5, 3, 2) = 4, with the following as one possible design:

  1  2  3
  1  4  5
  2  3  4
  2  3  5

First and foremost, this problem is NP-hard, so all known exact algorithms will scale exponentially in inputs v, k, and t. So while you may be able to solve small instances exactly by enumeration or some more clever exact method (e.g. integer programming), you will likely need to resort to heuristic methods as the problem size gets very large.

One possibility in this direction is lexicographic covering, as proposed in https://arxiv.org/pdf/math/9502238.pdf (you will note that many of the solutions on the site linked above list "lex covering" as the method of construction). Basically you list out all possible k-tuples in lexicographic order:

123
124
125
134
135
145
234
235
245
345

Then you greedily add the k-tuple that covers the most previously uncovered t-tuples, breaking ties using the lexicographic ordering.

Here's how the algorithm works in our case:

  1. At the beginning every 3-tuple covers 3 different 2-tuples, so we add 123 since it is lexicographically earliest.

  2. After doing this, the 2-tuples of 12, 13, and 23 have been covered, while all remaining 2-tuples are not covered. A number of 3-tuples cover 3 more 2-tuples, e.g. 145 and 245. We pick 145, since it is lexicographically first, covering 14, 45, and 15.

  3. Now we have 4 remaining uncovered 2-tuples -- 24, 25, 34, and 35. No 3-tuple covers 3 of these, but several cover 2, e.g. 234 and 345. We select 234 as the lexicographically earliest.

  4. We have two remaining uncovered 2-tuples -- 25 and 35. We select 235 as the only 3-tuple that covers both.

We end up with the exact solution shown above. Importantly, this is just a heuristic method -- it doesn't give any guarantee that 4 is the smallest number of 3-tuples needed to cover all pairs in a set with 5 elements. In this case, a lower bound by Schönheim (a reference is provided in the linked article above) convinces us that, in fact, C(5, 3, 2) cannot be smaller than 4. We conclude that the solution from lexicographic covering is in fact optimal.

You would need a tweak to cover each t-tuple a certain number of times r. One obvious one would just be to repeat each tuple to be covered "r" times, and then run lex covering as usual (so for instance in the first step above each 3-tuple would cover 9 2-tuples with r=3). Of course this remains a heuristic for your overall problem due to the use of lex covering.

like image 106
josliber Avatar answered Oct 10 '22 21:10

josliber


Here is an option using data.table to keep track of the matrix count and RcppAlgos to generate the combinations:

library(RcppAlgos)
library(data.table)

M <- 100 #5 #10 #100
sz <- 5 #3 #4 5 
minpick <- 3 #1 #2
d <- integer(M)

system.time({
    universe <- as.data.table(comboGeneral(M, 2L, nThreads=4L))[, count := 0L]
    ntuples <- 0
    while (universe[, any(count < minpick)]) {
        v <- universe[order(count), head(unique(c(V1[1L:2L], V2[1L:2L])), sz)]
        universe[as.data.table(comboGeneral(v, 2L, nThreads=4L)), on=.NATURAL, count := count + 1L]
        ntuples = ntuples + 1L
    }
    ntuples
})
#   user  system elapsed 
#  26.82    9.81   28.75 

m <- matrix(0L, nrow=M, ncol=M)
m[as.matrix(universe[, V1:V2])] <- universe$count
m + t(m) + diag(d)

It is a greedy algorithm, hence, I am not sure if this will result a minimum number of tuples.

like image 22
chinsoon12 Avatar answered Oct 10 '22 22:10

chinsoon12