Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a fast way to sample from a subset of GLn?

The rules of this problem are fairly specific because I'm actually looking at a subset of GLn, where the row and column vectors must have a certain form (call these vectors valid -- examples below), so please bear with me. Here are the rules:

  1. You can select a valid nonzero vector of length n uniformly at random.

  2. Given valid vectors v1, v2, ..., vk, you can determine whether or not the partial columns they form are prefixes of valid vectors (e.g. whether [v1_1 v2_1 ... vk_1] occurs as the first k entries of a valid vector) using the function isPrefix.

  3. Given valid vectors v1, v2, ..., vk, you can determine whether or not they are linearly dependent using the function areIndependent.

The goal is to sample from this subset of GLn uniformly at random. Here is a naive solution that fails:

    Step 1: Select a valid v1 uniformly at random. 
            If isPrefix(v1) then Go to Step 2.
            Otherwise repeat Step 1.

    Step 2: Select a valid v2 uniformly at random. 
            If areIndependent(v1,v2) & isPrefix(v1,v2), then go to Step 3. 
            Otherwise, repeat Step 2.

    ...

    Step n: Select a valid vn uniformly at random. 
            If areIndependent(v1,v2,...,vn) & isPrefix(v1,v2,...,vn), then END. 
            Otherwise, repeat Step n.

The problem with this would-be solution is that it can get stuck in an infinite loop in the not-too-unlikely event that areIndependent(v1,v2,...,vk) & isPrefix(v1,v2,...,vk) correctly returns TRUE but there is no way to complete this k-tuple to a linearly independent valid n-tuple. A common instance of this is when k=n-1 and there is a unique valid vector vn such that isPrefix(v1,v2,...,vn) is TRUE but this vector is not independent of the previous n-1 vectors.

Therefore I need to add in some way to back up a step or two when I'm in this loop, but I don't necessarily know when I'm in it. I'm looking for a solution along these lines: If Step k fails f(k) times for some explicit function f that may depend on the distribution of valid vectors, then go back to Step k-1 (or even further, in some explicit way).

Any suggestions, comments, references, etc will be greatly appreciated! Thanks!

EXAMPLES:

I'm actually looking for a general reference or guideline for how to proceed with the sampling. I have several examples of valid vectors on which I would like to do this, and it's more helpful to me to be able to do it on my own eventually than to list every example and have the SO community hash out solutions. However, to be concrete and give a sense of the difficulties involved, here is one example:

Alternating Sign Matrices: A vector is valid iff its entries are all 0, -1, 1, the nonzero entries alternate between 1s and -1s, and the sum of the entries is 1. For instance, every permutation matrix will consist of valid vectors, and also the following:

 0  1  0
 1 -1  1
 0  1  0

Distinct Entries: A vector is valid iff it has distinct entries. This one is particularly annoying because the condition applies to rows as well as columns.

Thanks again to all the folks who've looked at this!

like image 443
PengOne Avatar asked May 17 '11 05:05

PengOne


1 Answers

I suspect that you might have to move to a Markov Chain Monte Carlo algorithm - http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm - not necessarily for speed, but ensure that your random samples are sensibly distributed.

One definition of random sampling is to produce the same distribution as if you generated matrices at random from your original distribution of columns and then kept only the valid ones.

If you generate items from a tree, and the nodes do not all have the same degree, the leaves will not be visited with equal probability. Consider a simple tree with two non-leaf nodes, one of which has a single leaf child, with the other having one million leaf children. If you sample by moving down from the root, making a random choice at each node, the single leaf child will be visited more often than any particular leaf from the set with a million siblings.

Since you got stuck in an infinite loop above, you found a case where there is a node with no children. Assuming that there are any leaves at all, you have a tree where the nodes do not all have the same degree.

You may end up having to code different approaches for the different validity rules, and have to worry about how long it takes your Markov Chain to "burn in" and become reasonably random. There is one (sort of) exception from the later point. If you are trying to work out a tail probability to rule out the possibility that a given configuration was selected at random, you could start your Markov Chain from that point, because - under the null hypothesis - that point is already randomly chosen, so if all of your generated values have a larger statistic than that starting point, something fishy is going on.

like image 144
mcdowella Avatar answered Oct 27 '22 05:10

mcdowella