Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional sampling of binary vectors (?)

I'm trying to find a name for my problem, so I don't have to re-invent wheel when coding an algorithm which solves it...

I have say 2,000 binary (row) vectors and I need to pick 500 from them. In the picked sample I do column sums and I want my sample to be as close as possible to a pre-defined distribution of the column sums. I'll be working with 20 to 60 columns.

A tiny example:

Out of the vectors:

110
010
011
110
100

I need to pick 2 to get column sums 2, 1, 0. The solution (exact in this case) would be

110
100

My ideas so far

  • one could maybe call this a binary multidimensional knapsack, but I did not find any algos for that
  • Linear Programming could help, but I'd need some step by step explanation as I got no experience with it
  • as exact solution is not always feasible, something like simulated annealing brute force could work well
  • a hacky way using constraint solvers comes to mind - first set the constraints tight and gradually loosen them until some solution is found - given that CSP should be much faster than ILP...?
like image 971
liborm Avatar asked Aug 13 '19 12:08

liborm


1 Answers

My concrete, practical (if the approximation guarantee works out for you) suggestion would be to apply the maximum entropy method (in Chapter 7 of Boyd and Vandenberghe's book Convex Optimization; you can probably find several implementations with your favorite search engine) to find the maximum entropy probability distribution on row indexes such that (1) no row index is more likely than 1/500 (2) the expected value of the row vector chosen is 1/500th of the predefined distribution. Given this distribution, choose each row independently with probability 500 times its distribution likelihood, which will give you 500 rows on average. If you need exactly 500, repeat until you get exactly 500 (shouldn't take too many tries due to concentration bounds).

like image 90
David Eisenstat Avatar answered Oct 05 '22 18:10

David Eisenstat