Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the least-used permutation

I need to distribute a set of data evenly over time based on historical data such that each digit appears an equal (or close to equal) number of times in each position over time. The problem is, given a list of orderings used in the past, that look like this (but could have any number of elements):

1,2,5,3,4
4,1,5,2,3
1,3,5,2,4
4,1,2,3,5
2,4,1,3,5
5,1,4,3,2
1,5,3,2,4
5,1,3,2,4
3,2,5,4,1
4,3,1,5,2

how can I find an ordering of the values that is the least used and will lead to a "more balanced" set of orderings. The obvious answer is I could group by and count them and pick the least used one, but the problem is the least used permutation may not have ever been used, for example here, the ordering "1,2,3,4,5" is a candidate for least used because it doesn't appear at all.

The simple answer seems to be to identify which position "1" appears in the least frequent and set that position to "1" and so on for each digit. I suspect that works, but I feel like there's a more elegant solution that I haven't considered potentially with cross joins so that all possible combinations are included.

any ideas?

like image 201
powlette Avatar asked Sep 22 '11 15:09

powlette


1 Answers

What you have here is a histogram leveling problem.

Consider the problem from this perspective: you have a set of N histograms that represent the frequency of occurrence of the value N values over a discrete range {1..N}. What you want to do is to add a new set of values to your population of data that shifts the all histograms closer to being level. Given the nature of your problem, we know that each value will, overall, appear the same number of times as every other value.

One way to do so, is to find which values N has the lowest frequency of occurence in any position - and assign it that position. Next, in the remaining histograms, find the next value with the lowest frequency of occurence in any position, and assign that value to that position. Continue repeating this process until all values have been assigned a unique position. This gives you your next set of values. You can now iteratively repeat this operation to continue generating new value sets that will attempt to re-balance the distribution of values with each iteration.

If you maintain the histograms as you distribute values, this becomes a relatively efficient operation (you don't have to constantly re-scan the data set).

Keep in mind, however, that for any sufficiently small population of values, you will always be "out of balance" to some degree. There's no way around this.

like image 145
LBushkin Avatar answered Oct 26 '22 23:10

LBushkin