Suppose I have a 2D array like the following:
GACTG
AGATA
TCCGA
Each array element is taken from a small finite set (in my case, DNA nucleotides -- {A, C, G, T}
). I would like to randomly shuffle this array somehow while preserving both row and column nucleotide frequencies. Is this possible? Can it be done efficiently?
[EDIT]: By this I mean I want to produce a new matrix where each row has the same number of A
s, C
s, G
s and T
s as the corresponding row of the original matrix, and where each column has the same number of A
s, C
s, G
s and T
s as the corresponding column of the original matrix. Permuting the rows or columns of the original matrix will not achieve this in general. (E.g. for the example above, the top row has 2 G
s, and 1 each of A
, C
and T
; if this row was swapped with row 2, the top row in the resulting matrix would have 3 A
s, 1 G
and 1 T
.)
It's simple enough to preserve just column frequencies by shuffling a column at a time, and likewise for rows. But doing this will in general alter the frequencies of the other kind.
My thoughts so far: If it's possible to pick 2 rows and 2 columns so that the 4 elements at the corners of this rectangle have the pattern
XY
YX
for some pair of distinct elements X
and Y
, then replacing these 4 elements with
YX
XY
will maintain both row and column frequencies. In the example at the top, this can be done for (at least) rows 1 and 2 and columns 2 and 5 (whose corners give the 2x2 matrix AG;GA
), and for rows 1 and 3 and columns 1 and 4 (whose corners give GT;TG
). Clearly this could be repeated a number of times to produce some level of randomisation.
Generalising a bit, any "subrectangle" induced by a subset of rows and a subset of columns, in which the frequencies of all rows are the same and the frequencies of all columns are the same, can have both its rows and columns permuted to produce a valid complete rectangle. (Of these, only those subrectangles in which at least 1 element is changed are actually interesting.) Big questions:
This question addresses a special case in which the set of possible elements is {0, 1}
. The solutions people have come up with there are similar to what I have come up with myself, and are probably usable, but not ideal as they require an arbitrary amount of backtracking to work correctly. Also I'm concerned that only 2x2 swaps are considered.
Finally, I would ideally like a solution that can be proven to select a matrix uniformly at random from the set of all matrices having identical row frequencies and column frequencies to the original. I know, I'm asking for a lot :)
Edit: oops missed the last paragraph of OP's question, let me rephrase.
To digress briefly, the question you linked to had quite a hilarious discussion about the "level" of randomness for the selected solution, allow me to paraphrase:
"...I really require matrices that are as random as possible..."
"...The algorithm, as implemented in the code, is quite random..."
"...if you choose this method, a different way to improve the randomness is to repeat the randomization process several times (a random number of times)..."
None of these comments make any sort of sense, there is no such thing as "more" random, this is all exactly like this lovely Daily WTF entry. That said, the last quote is almost onto something. It's well known that if you simulate a Markov chain, like that random swapping algorithm, for long enough you will eventually start generating samples from the steady state distribution. Just exactly what that distribution looks like, who knows...
Anyway, depending on your objectives you may not really care what this distribution looks like as long as it contains enough elements. So some sort of swapping algorithm might be useful, but I really would not expect this to be easy since the problem is NP-Complete (more general than Sudoku).
With that in mind, you could consider solving your problem any approach that works for solving Sudoku, if you are in Acadamia I would suggest getting a copy of IBM CPLEX 12 which is free for academic use. You can code up a Sudoku-like solver in their CP language (OPL) and as the integer linear program solver to generate solutions for you. I think they even have example code for solving Sudoku you can borrow from.
Here's the only truly random and unbiased way I can think of to sample from such matrices: First get CPLEX to find all N solutions to the given Sudoku problem. After you have this set of N solutions, draw a random number between 1 and N and use that solution, if you want another one, draw another number. Since generating all solutions might be a bit slow, you could approximate something like this by telling the solver to stop after a certain number of solutions or time elapsed and only sample from that set.
The answer to question 2 is no. Consider the following 2 matrices:
A B C C A B
C A B B C A
B C A A B C
They clearly have the same row and column frequencies. Yet, there is no 2x2 submatrix with common corners.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With