Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interleaving sparse sorted arrays

I've got a set of lists of events. The events always happen in a given order, but not every event always happens. Here's an example input:

[[ do, re, fa, ti ],
 [ do, re, mi ],
 [ do, la, ti, za ],
 [ mi, fa ],
 [ re, so, za ]]

The input values don't have any inherent order. They're actually messages like "creating symlinks" and "reindexing search". They're sorted in the individual list, but there's no way to look at only 'fa' in the first list and 'mi' in the second and determine which comes before the other.

I'd like to be able to take that input and generate a sorted list of all events:

[ do, re, mi, fa, so, la, ti, za ]

or better yet, some information about each event, like a count:

[ [do, 3], [re, 3], [mi, 2],
  [fa, 2], [so, 1], [la, 1],
  [ti, 1], [za, 2] ]

Is there a name for what I'm doing? Are there accepted algorithms? I'm writing this in Perl, if that matters, but pseudocode will do.

I know that given my example input, I probably can't be guaranteed of the "right" order. But my real input has tons more datapoints, and I feel confident that with some cleverness it'll be 95% right (which is really all I need). I just don't want to re-invent the wheel if I don't have to.

like image 351
Plutor Avatar asked Jul 09 '10 18:07

Plutor


1 Answers

You can use tsort to infer a reasonable—although not necessarily unique—sort order (known as a topological order) from the ordering you've observed. You may be interested in reading tsort's original use, which is similar in structure to your problem.

Note that tsort requires an acyclic graph. In terms of your example, this means you couldn't see do followed by re in one sequence and re followed by do in another.

#! /usr/bin/perl

use warnings;
use strict;

use IPC::Open2;

sub tsort {
  my($events) = @_;

  my $pid = open2 my $out, my $in, "tsort";

  foreach my $group (@$events) {
    foreach my $i (0 .. $#$group - 1) {
      print $in map "@$group[$i,$_]\n", $i+1 .. $#$group;
    }
  }

  close $in or warn "$0: close: $!";

  chomp(my @order = <$out>);
  my %order = map +(shift @order => $_), 0 .. $#order;
  wantarray ? %order : \%order;
}

Because you described the data as sparse, the code above provides tsort with as much information as possible about the events' adjacency matrix.

Having that information, computing a histogram and sorting its components is straightforward:

my $events = [ ... ];

my %order = tsort $events;

my %seen;
do { ++$seen{$_} for @$_ } for @$events;

my @counts;
foreach my $event (sort { $order{$a} <=> $order{$b} } keys %seen) {
  push @counts => [ $event, $seen{$event} ];
  print "[ $counts[-1][0], $counts[-1][1] ]\n";
}

For the input in your question you provided, the output is

[ do, 3 ]
[ la, 1 ]
[ re, 3 ]
[ so, 1 ]
[ mi, 2 ]
[ fa, 2 ]
[ ti, 2 ]
[ za, 2 ]

This looks funny because we know the order of solfège, but re and la are incomparable in the partial order defined by $events: we know only that they must both come after do.

like image 127
Greg Bacon Avatar answered Sep 18 '22 15:09

Greg Bacon