My task is to create a parent-child hierarchy file using perl.
Sample Input file (tab delimited). The records would be arranged in the file in random order and the "parent" may appear after the "child".
S5 S3
S5 S8
ROOT S1
S1 S7
S2 S5
S3 S4
S1 S2
S4 77
S2 S9
S3 88
Sample Output File (tab delimited)
ROOT S1 S2 S5 S3 S4 77
ROOT S1 S2 S5 S3 88
ROOT S1 S7
ROOT S1 S2 S5 S8
ROOT S1 S2 S9
The code which produces the above output file
use strict;
# usage: perl parent_child_generator.pl input.txt output.txt
my $input0=$ARGV[0] or die "must provide input.txt as the first argument\n";
my $output1=$ARGV[1] or die "must provide output.txt as the second argument\n";
open(IN0,"<",$input0) || die "Cannot open $input0 for reading: $!";
open(OUT1,">",$output1) || die "Cannot open $output1 for writing: $!";
sub trim
{
my $string=shift;
$string=~s/\r?\n$//;
$string=~s/^\s+//;
$string=~s/\s+$//;
return $string;
}
sub connectByPrior
{
my $in_child=$_[0];
my %in_hash=%{$_[1]};
my @anscestor_arr;
for (sort keys %in_hash)
{
my $key=$_;
my @key_arr=split(/\t/,$key);
my $parent=$key_arr[0];
my $child=$key_arr[1];
if ($in_child eq $child)
{
push (@anscestor_arr,$parent);
@anscestor_arr=(@{connectByPrior($parent,\%in_hash)},@anscestor_arr);
last;
}
}
return \@anscestor_arr;
}
my %parent_hash;
my %child_hash;
my %unsorted_hash;
while(<IN0>)
{
my @cols=split(/\t/);
for (my $i=0; $i < scalar(@cols); $i++)
{
$cols[$i]= trim($cols[$i]);
}
my $parent=$cols[0];
my $child=$cols[1];
my $parent_child="$parent\t$child";
$parent_hash{$parent}=1;
$child_hash{$child}=1;
$unsorted_hash{$parent_child}=1;
}
close(IN0);
my @lev0_arr;
for (sort keys %child_hash)
{
my $rec=$_;
if (!exists($parent_hash{$rec}))
{
push (@lev0_arr,$rec);
}
}
for (@lev0_arr)
{
my $child=$_;
my @anscestor_arr=@{connectByPrior($child,\%unsorted_hash)};
push (@anscestor_arr,$child);
print OUT1 join("\t",@anscestor_arr)."\n";
}
Question: The code works fine if input file is not too big. The actual input file contains more than 200k lines and the code is taking too long time to process the output. What improvements / changes do you suggest so that it does not take too long to process?
You seem to be trying to construct and pretty-print a directed graph:
#!/usr/bin/perl
use strict; use warnings;
use Graph::Directed;
use Graph::TransitiveClosure::Matrix;
my $g = Graph::Directed->new;
while ( my $line = <DATA> ) {
next unless my ($x, $y) = split ' ', $line;
$g->add_edge($x, $y);
}
my @start = $g->source_vertices;
my @end = $g->sink_vertices;
my $tcm = Graph::TransitiveClosure::Matrix->new( $g,
path_vertices => 1,
);
for my $s ( @start ) {
for my $e ( @end ) {
next unless $tcm->is_reachable($s, $e);
print join("\t", $tcm->path_vertices($s, $e)), "\n";
}
}
__DATA__
S5 S3
S5 S8
ROOT S1
S1 S7
S2 S5
S3 S4
S1 S2
S4 77
S2 S9
S3 88
Output:
ROOT S1 S2 S9 ROOT S1 S2 S5 S8 ROOT S1 S2 S5 S3 S4 77 ROOT S1 S2 S5 S3 88 ROOT S1 S7
I am not sure if the memory overhead of using Graph and computing a transitive closure matrix is going to be prohibitive in your case.
First thing that comes to mind, although it's completely unrelated to your actual question, is that you should at least consider making your interface more flexible. If you read from <>
and print to STDOUT
, you'll be able to take input from either STDIN
or an arbitrarily-sized list of files given on the command line, while output can either be viewed on the console or redirected into a file with only a minor change in calling convention:
parent_child_generator.pl input1.txt input2.txt input3.txt > output.txt
Another minor point is that the $string=~s/\r?\n$//;
in trim
is unnecessary[1]. $string=~s/\s+$//;
will take care of it:
$ perl -e 'my $foo = "test\r\n"; print "--$foo--\n"; $foo =~ s/\s+$//; print "--$foo--\n";'
--test
--
--test--
Getting to your performance issue (finally...), the core problem is that you're calling connectByPrior
for every element in @lev0_arr
[2] and connectByPrior
not only loops over %unsorted_hash
[3] each time it's called, but, within that loop, it calls itself recursively! At a first approximation, that's somewhere between O(n^2 log n) and O(n^3), depending on the shape of your trees, which is simply horrible. You need to avoid touching every piece of data multiple times for every other piece of data you receive.
So, how would I go about doing that? My first thought is to use one hash to keep track of my root nodes (all those which have nothing linking to them) and one hash of hashes (HoH) to keep track of all links. As each input line is seen, split it into parent and child, as you're doing. If the parent does not already have an entry in the link HoH, add it to the root hash. If the child is in the root hash, remove it. If the child is not in the links HoH, add an empty hashref for it (so we'll know in the future that it's not a root). Finally, add an entry in the link HoH indicating that the parent links to the child.
Output is then just a matter of iterating over the root hash (your list of starting points) and, for each node found there, recursively printing that node's children.
Like so:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my %root;
my %link;
while (<>) {
my ($parent, $child) = split /\t/, $_, 2;
next unless defined $parent and defined $child;
$_ = trim($_) for ($parent, $child);
$root{$parent} = 1 unless exists $link{$parent};
delete $root{$child};
$link{$child} ||= {};
$link{$parent}{$child} = 1;
}
print_links($_) for sort keys %root;
exit;
sub trim {
my $string=shift;
$string=~s/^\s+//;
$string=~s/\s+$//;
return $string;
}
sub print_links {
my @path = @_;
my %children = %{$link{$path[-1]}};
if (%children) {
print_links(@path, $_) for sort keys %children;
} else {
say join "\t", @path;
}
}
Given your example input, this produces the output:
ROOT S1 S2 S5 S3 88
ROOT S1 S2 S5 S3 S4 77
ROOT S1 S2 S5 S8
ROOT S1 S2 S9
ROOT S1 S7
Since this version only touches each link once for input and once for output, it should scale more-or-less linearly as the amount of input data increases.
(Of course, Sinan is right to suggest that you should go to CPAN if you're really looking to get things done, but I was having fun with it.)
Edit: Code should test whether $parent
and $child
are defined, not whether they're true, per Sinan's comment.
[1] You should normally use chomp
to remove linebreaks instead of a regex anyhow, but I'm giving you the benefit of the doubt and assuming that you may be processing input containing one style of line breaks in an environment which uses the other style.
[2] ...which contains all the leaf nodes, so it's going to get pretty big with 200k input lines unless you have extremely narrow, deep trees.
[3] ...which contains every input line, trimmed of extraneous whitespace.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With