I've got a set of lists of events. The events always happen in a given order, but not every event always happens. Here's an example input:
[[ do, re, fa, ti ],
[ do, re, mi ],
[ do, la, ti, za ],
[ mi, fa ],
[ re, so, za ]]
The input values don't have any inherent order. They're actually messages like "creating symlinks" and "reindexing search". They're sorted in the individual list, but there's no way to look at only 'fa' in the first list and 'mi' in the second and determine which comes before the other.
I'd like to be able to take that input and generate a sorted list of all events:
[ do, re, mi, fa, so, la, ti, za ]
or better yet, some information about each event, like a count:
[ [do, 3], [re, 3], [mi, 2],
[fa, 2], [so, 1], [la, 1],
[ti, 1], [za, 2] ]
Is there a name for what I'm doing? Are there accepted algorithms? I'm writing this in Perl, if that matters, but pseudocode will do.
I know that given my example input, I probably can't be guaranteed of the "right" order. But my real input has tons more datapoints, and I feel confident that with some cleverness it'll be 95% right (which is really all I need). I just don't want to re-invent the wheel if I don't have to.
You can use tsort
to infer a reasonable—although not necessarily unique—sort order (known as a topological order) from the ordering you've observed. You may be interested in reading tsort
's original use, which is similar in structure to your problem.
Note that tsort
requires an acyclic graph. In terms of your example, this means you couldn't see do followed by re in one sequence and re followed by do in another.
#! /usr/bin/perl
use warnings;
use strict;
use IPC::Open2;
sub tsort {
my($events) = @_;
my $pid = open2 my $out, my $in, "tsort";
foreach my $group (@$events) {
foreach my $i (0 .. $#$group - 1) {
print $in map "@$group[$i,$_]\n", $i+1 .. $#$group;
}
}
close $in or warn "$0: close: $!";
chomp(my @order = <$out>);
my %order = map +(shift @order => $_), 0 .. $#order;
wantarray ? %order : \%order;
}
Because you described the data as sparse, the code above provides tsort
with as much information as possible about the events' adjacency matrix.
Having that information, computing a histogram and sorting its components is straightforward:
my $events = [ ... ];
my %order = tsort $events;
my %seen;
do { ++$seen{$_} for @$_ } for @$events;
my @counts;
foreach my $event (sort { $order{$a} <=> $order{$b} } keys %seen) {
push @counts => [ $event, $seen{$event} ];
print "[ $counts[-1][0], $counts[-1][1] ]\n";
}
For the input in your question you provided, the output is
[ do, 3 ] [ la, 1 ] [ re, 3 ] [ so, 1 ] [ mi, 2 ] [ fa, 2 ] [ ti, 2 ] [ za, 2 ]
This looks funny because we know the order of solfège, but re and la are incomparable in the partial order defined by $events
: we know only that they must both come after do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With