Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Evenly distribute repetitive strings

I need to distribute a set of repetitive strings as evenly as possible.

Is there any way to do this better then simple shuffling using unsort? It can't do what I need.

For example if the input is

aaa
aaa
aaa
bbb
bbb

The output I need

aaa
bbb
aaa
bbb
aaa

The number of repetitive strings doesn't have any limit as well as the number of the reps of any string. The input can be changed to list string number_of_reps

aaa 3
bbb 2
... .
zzz 5

Is there an existing tool, Perl module or algorithm to do this?

like image 405
Guha Avatar asked Apr 21 '13 08:04

Guha


1 Answers

Abstract: Given your description of how you determine an “even distribution”, I have written an algorithm that calculates a “weight” for each possible permutation. It is then possible to brute-force the optimal permutation.

Weighing an arrangement of items

By "evenly distribute" I mean that intervals between each two occurrences of a string and the interval between the start point and the first occurrence of the string and the interval between the last occurrence and the end point must be as much close to equal as possible where 'interval' is the number of other strings.

It is trivial to count the distances between occurrences of strings. I decided to count in a way that the example combination

A B A C B A A

would give the count

A: 1 2 3 1 1
B: 2 3 3
C: 4 4

I.e. Two adjacent strings have distance one, and a string at the start or the end has distance one to the edge of the string. These properties make the distances easier to calculate, but are just a constant that will be removed later.

This is the code for counting distances:

sub distances {
    my %distances;
    my %last_seen;

    for my $i (0 .. $#_) {
        my $s = $_[$i];
        push @{ $distances{$s} }, $i - ($last_seen{$s} // -1);
        $last_seen{$s} = $i;
    }

    push @{ $distances{$_} }, @_ - $last_seen{$_} for keys %last_seen;

    return values %distances;
}

Next, we calculate the standard variance for each set of distances. The variance of one distance d describes how far they are off from the average a. As it is squared, large anomalies are heavily penalized:

variance(d, a) = (a - d)²

We get the standard variance of a data set by summing the variance of each item, and then calculating the square root:

svar(items) = sqrt ∑_i variance(items[i], average(items))

Expressed as Perl code:

use List::Util qw/sum min/;

sub svar (@) {
    my $med = sum(@_) / @_;
    sqrt sum map { ($med - $_) ** 2 } @_;
}

We can now calculate how even the occurrences of one string in our permutation are, by calculating the standard variance of the distances. The smaller this value is, the more even the distribution is.

Now we have to combine these weights to a total weight of our combination. We have to consider the following properties:

  • Strings with more occurrences should have greater weight that strings with fewer occurrences.
  • Uneven distributions should have greater weight than even distributions, to strongly penalize unevenness.

The following can be swapped out by a different procedure, but I decided to weigh each standard variance by raising it to the power of occurrences, then adding all weighted svariances:

sub weigh_distance {
    return sum map {
        my @distances = @$_; # the distances of one string
        svar(@distances) ** $#distances;
    } distances(@_);
}

This turns out to prefer good distributions.

We can now calculate the weight of a given permutation by passing it to weigh_distance. Therefore, we can decide if two permutations are equally well distributed, or if one is to be prefered:

Selecting optimal permutations

Given a selection of permuations, we can select those permutations that are optimal:

sub select_best {
    my %sorted;
    for my $strs (@_) {
        my $weight = weigh_distance(@$strs);
        push @{ $sorted{$weight} }, $strs;
    }
    my $min_weight = min keys %sorted;
    @{ $sorted{$min_weight} }
}

This will return at least one of the given possibilities. If the exact one is unimportant, an arbitrary element of the returend array can be selected.

Bug: This relies on stringification of floats, and is therefore open to all kinds of off-by-epsilon errors.

Creating all possible permutations

For a given multiset of strings, we want to find the optimal permutation. We can think of the available strings as a hash mapping the strings to the remaining avaliable occurrences. With a bit of recursion, we can build all permutations like

use Carp;
# called like make_perms(A => 4, B => 1, C => 1)
sub make_perms {
    my %words = @_;
    my @keys =
        sort  # sorting is important for cache access
        grep { $words{$_} > 0 }
        grep { length or carp "Can't use empty strings as identifiers" }
        keys %words;
    my ($perms, $ok) = _fetch_perm_cache(\@keys, \%words);
    return @$perms if $ok;
    # build perms manually, if it has to be.
    # pushing into @$perms directly updates the cached values
    for my $key (@keys) {
        my @childs = make_perms(%words, $key => $words{$key} - 1);
        push @$perms, (@childs ? map [$key, @$_], @childs : [$key]);
    }
    return @$perms;
}

The _fetch_perm_cache returns an ref to a cached array of permutations, and a boolean flag to test for success. I used the following implementation with deeply nested hashes, that stores the permutations on leaf nodes. To mark the leaf nodes, I have used the empty string—hence the above test.

sub _fetch_perm_cache {
    my ($keys, $idxhash) = @_;
    state %perm_cache;
    my $pointer = \%perm_cache;
    my $ok = 1;
    $pointer = $pointer->{$_}[$idxhash->{$_}] //= do { $ok = 0; +{} } for @$keys;
    $pointer = $pointer->{''} //= do { $ok = 0; +[] }; # access empty string key
    return $pointer, $ok;
}

That not all strings are valid input keys is no issue: every collection can be enumerated, so make_perms could be given integers as keys, which are translated back to whatever data they represent by the caller. Note that the caching makes this non-threadsafe (if %perm_cache were shared).

Connecting the pieces

This is now a simple matter of

say "@$_" for select_best(make_perms(A => 4, B => 1, C => 1))

This would yield

A A C A B A
A A B A C A
A C A B A A
A B A C A A

which are all optimal solutions by the used definition. Interestingly, the solution

A B A A C A

is not included. This could be a bad edge case of the weighing procedure, which strongly favours putting occurrences of rare strings towards the center. See Futher work.

Completing the test cases

Preferable versions are first: AABAA ABAAA, ABABACA ABACBAA(two 'A' in a row), ABAC ABCA

We can run these test cases by

use Test::More tests => 3;
my @test_cases = (
  [0 => [qw/A A B A A/], [qw/A B A A A/]],
  [1 => [qw/A B A C B A A/], [qw/A B A B A C A/]],
  [0 => [qw/A B A C/], [qw/A B C A/]],
);
for my $test (@test_cases) {
  my ($correct_index, @cases) = @$test;
  my $best = select_best(@cases);
  ok $best ~~ $cases[$correct_index], "[@{$cases[$correct_index]}]";
}

Out of interest, we can calculate the optimal distributions for these letters:

my @counts = (
  { A => 4, B => 1 },
  { A => 4, B => 2, C => 1},
  { A => 2, B => 1, C => 1},
);
for my $count (@counts) {
  say "Selecting best for...";
  say "  $_: $count->{$_}" for keys %$count;
  say "@$_" for select_best(make_perms(%$count));
}

This brings us

Selecting best for...
  A: 4
  B: 1
A A B A A
Selecting best for...
  A: 4
  C: 1
  B: 2
A B A C A B A
Selecting best for...
  A: 2
  C: 1
  B: 1
A C A B
A B A C
C A B A
B A C A

Further work

  • Because the weighing attributes the same importance to the distance to the edges as to the distance between letters, symmetrical setups are preferred. This condition could be eased by reducing the value of the distance to the edges.
  • The permutation generation algorithm has to be improved. Memoization could lead to a speedup. Done! The permutation generation is now 50× faster for synthetic benchmarks, and can access cached input in O(n), where n is the number of different input strings.
  • It would be great to find a heuristic to guide the permutation generation, instead of evaluating all posibilities. A possible heuristic would consider whether there are enough different strings available that no string has to neighbour itself (i.e. distance 1). This information could be used to narrow the width of the search tree.
  • Transforming the recursive perm generation to an iterative solution would allow to interweave searching with weight calculation, which would make it easier to skip or defer unfavourable solutions.
  • The standard variances are raised to the power of the occurrences. This is probably not ideal, as a large deviation for a large number of occurrences weighs lighter than a small deviation for few occurrences, e.g.

    weight(svar, occurrences) → weighted_variance
    weight(0.9, 10) → 0.35
    weight(0.5, 1)  → 0.5
    

    This should in fact be reversed.

Edit

Below is a faster procedure that approximates a good distribution. In some cases, it will yield the correct solution, but this is not generally the case. The output is bad for inputs with many different strings where most have very few occurrences, but is generally acceptable where only few strings have few occurrences. It is significantly faster than the brute-force solution.

It works by inserting strings at regular intervals, then spreading out avoidable repetitions.

sub approximate {
    my %def = @_;
    my ($init, @keys) = sort { $def{$b} <=> $def{$a} or $a cmp $b } keys %def;
    my @out = ($init) x $def{$init};
    while(my $key = shift @keys) {
        my $visited = 0;
        for my $parts_left (reverse 2 .. $def{$key} + 1) {
            my $interrupt = $visited + int((@out - $visited) / $parts_left);
            splice @out, $interrupt, 0, $key;
            $visited = $interrupt + 1;
        }
    }
    # check if strings should be swapped
    for my $i ( 0 .. $#out - 2) {
        @out[$i, $i + 1] = @out[$i + 1, $i]
            if  $out[$i] ne $out[$i + 1]
            and $out[$i + 1] eq $out[$i + 2]
            and (!$i or $out[$i + 1 ] ne $out[$i - 1]);
    }
    return @out;
}

Edit 2

I generalized the algorithm for any objects, not just strings. I did this by translating the input to an abstract representation like “two of the first thing, one of the second”. The big advantage here is that I only need integers and arrays to represent the permutations. Also, the cache is smaller, because A => 4, C => 2, C => 4, B => 2 and $regex => 2, $fh => 4 represent the same abstract multisets. The speed penalty incurred by the neccessity to transform data between the external, internal, and cache representations is roughly balanced by the reduced number of recursions.

The large bottleneck is in the select_best sub, which I have largely rewritten in Inline::C (still eats ~80% of execution time).

These issues go a bit beyond the scope of the original question, so I won't paste the code in here, but I guess I'll make the project available via github once I've ironed out the wrinkles.

like image 102
amon Avatar answered Oct 19 '22 22:10

amon