Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create a hierarchy file using perl

Tags:

perl

My task is to create a parent-child hierarchy file using perl.

Sample Input file (tab delimited). The records would be arranged in the file in random order and the "parent" may appear after the "child".

 S5 S3
 S5 S8
 ROOT   S1
 S1 S7
 S2 S5
 S3 S4
 S1 S2
 S4 77
 S2 S9
 S3 88

Sample Output File (tab delimited)

ROOT    S1  S2  S5  S3  S4  77
ROOT    S1  S2  S5  S3  88
ROOT    S1  S7
ROOT    S1  S2  S5  S8
ROOT    S1  S2  S9

The code which produces the above output file

use strict;

# usage: perl parent_child_generator.pl input.txt output.txt

my $input0=$ARGV[0] or die "must provide input.txt as the first argument\n";
my $output1=$ARGV[1] or die "must provide output.txt as the second argument\n";

open(IN0,"<",$input0) || die "Cannot open $input0 for reading: $!";
open(OUT1,">",$output1) || die "Cannot open $output1 for writing: $!";

sub trim
{
    my $string=shift;
$string=~s/\r?\n$//;
$string=~s/^\s+//;
$string=~s/\s+$//;
return $string;
}

sub connectByPrior
{
my $in_child=$_[0];
my %in_hash=%{$_[1]};
my @anscestor_arr;

for (sort keys %in_hash)
{
    my $key=$_;
    my @key_arr=split(/\t/,$key);
    my $parent=$key_arr[0];
    my $child=$key_arr[1];

    if ($in_child eq $child)
    {
        push (@anscestor_arr,$parent);
        @anscestor_arr=(@{connectByPrior($parent,\%in_hash)},@anscestor_arr);
        last;
    }
}
return \@anscestor_arr;
}

my %parent_hash;
my %child_hash;
my %unsorted_hash;
while(<IN0>)
{
my @cols=split(/\t/);
for (my $i=0; $i < scalar(@cols); $i++)
{
    $cols[$i]= trim($cols[$i]);
}

my $parent=$cols[0];
my $child=$cols[1];
my $parent_child="$parent\t$child";

$parent_hash{$parent}=1;
$child_hash{$child}=1;
$unsorted_hash{$parent_child}=1;
 }
 close(IN0);

my @lev0_arr;
for (sort keys %child_hash)
{
my $rec=$_;
if (!exists($parent_hash{$rec}))
{
    push (@lev0_arr,$rec);
}
}

for (@lev0_arr)
{
my $child=$_;
my @anscestor_arr=@{connectByPrior($child,\%unsorted_hash)};
push (@anscestor_arr,$child);
print OUT1 join("\t",@anscestor_arr)."\n";
}

Question: The code works fine if input file is not too big. The actual input file contains more than 200k lines and the code is taking too long time to process the output. What improvements / changes do you suggest so that it does not take too long to process?

like image 213
sachin Avatar asked Oct 05 '10 09:10

sachin


2 Answers

You seem to be trying to construct and pretty-print a directed graph:

#!/usr/bin/perl

use strict; use warnings;
use Graph::Directed;
use Graph::TransitiveClosure::Matrix;

 my $g = Graph::Directed->new;

while ( my $line = <DATA> ) {
    next unless my ($x, $y) = split ' ', $line;
    $g->add_edge($x, $y);
}

my @start = $g->source_vertices;
my @end   = $g->sink_vertices;

my $tcm = Graph::TransitiveClosure::Matrix->new( $g,
    path_vertices => 1,
);

for my $s ( @start ) {
    for my $e ( @end ) {
        next unless $tcm->is_reachable($s, $e);
        print join("\t", $tcm->path_vertices($s, $e)), "\n";
    }
}

__DATA__
S5 S3
S5 S8
ROOT   S1
S1 S7
S2 S5
S3 S4
S1 S2
S4 77
S2 S9
S3 88

Output:

ROOT    S1      S2      S9
ROOT    S1      S2      S5      S8
ROOT    S1      S2      S5      S3      S4      77
ROOT    S1      S2      S5      S3      88
ROOT    S1      S7

I am not sure if the memory overhead of using Graph and computing a transitive closure matrix is going to be prohibitive in your case.

like image 95
Sinan Ünür Avatar answered Oct 02 '22 01:10

Sinan Ünür


First thing that comes to mind, although it's completely unrelated to your actual question, is that you should at least consider making your interface more flexible. If you read from <> and print to STDOUT, you'll be able to take input from either STDIN or an arbitrarily-sized list of files given on the command line, while output can either be viewed on the console or redirected into a file with only a minor change in calling convention:

parent_child_generator.pl input1.txt input2.txt input3.txt > output.txt

Another minor point is that the $string=~s/\r?\n$//; in trim is unnecessary[1]. $string=~s/\s+$//; will take care of it:

$ perl -e 'my $foo = "test\r\n"; print "--$foo--\n"; $foo =~ s/\s+$//; print "--$foo--\n";'
--test
--
--test--

Getting to your performance issue (finally...), the core problem is that you're calling connectByPrior for every element in @lev0_arr[2] and connectByPrior not only loops over %unsorted_hash[3] each time it's called, but, within that loop, it calls itself recursively! At a first approximation, that's somewhere between O(n^2 log n) and O(n^3), depending on the shape of your trees, which is simply horrible. You need to avoid touching every piece of data multiple times for every other piece of data you receive.

So, how would I go about doing that? My first thought is to use one hash to keep track of my root nodes (all those which have nothing linking to them) and one hash of hashes (HoH) to keep track of all links. As each input line is seen, split it into parent and child, as you're doing. If the parent does not already have an entry in the link HoH, add it to the root hash. If the child is in the root hash, remove it. If the child is not in the links HoH, add an empty hashref for it (so we'll know in the future that it's not a root). Finally, add an entry in the link HoH indicating that the parent links to the child.

Output is then just a matter of iterating over the root hash (your list of starting points) and, for each node found there, recursively printing that node's children.

Like so:

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

my %root;
my %link;

while (<>) {
  my ($parent, $child) = split /\t/, $_, 2;
  next unless defined $parent and defined $child;
  $_ = trim($_) for ($parent, $child);

  $root{$parent} = 1 unless exists $link{$parent};
  delete $root{$child};
  $link{$child} ||= {};
  $link{$parent}{$child} = 1;
}

print_links($_) for sort keys %root;

exit;

sub trim {
  my $string=shift;
  $string=~s/^\s+//;
  $string=~s/\s+$//;
  return $string;
}

sub print_links {
  my @path = @_;

  my %children = %{$link{$path[-1]}};
  if (%children) {
    print_links(@path, $_) for sort keys %children;
  } else {
    say join "\t", @path;
  }
}

Given your example input, this produces the output:

ROOT    S1      S2      S5      S3      88
ROOT    S1      S2      S5      S3      S4      77
ROOT    S1      S2      S5      S8
ROOT    S1      S2      S9
ROOT    S1      S7

Since this version only touches each link once for input and once for output, it should scale more-or-less linearly as the amount of input data increases.

(Of course, Sinan is right to suggest that you should go to CPAN if you're really looking to get things done, but I was having fun with it.)

Edit: Code should test whether $parent and $child are defined, not whether they're true, per Sinan's comment.

[1] You should normally use chomp to remove linebreaks instead of a regex anyhow, but I'm giving you the benefit of the doubt and assuming that you may be processing input containing one style of line breaks in an environment which uses the other style.

[2] ...which contains all the leaf nodes, so it's going to get pretty big with 200k input lines unless you have extremely narrow, deep trees.

[3] ...which contains every input line, trimmed of extraneous whitespace.

like image 40
Dave Sherohman Avatar answered Oct 02 '22 01:10

Dave Sherohman