Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimization: Python, Perl, and a C Suffix Tree Library

I've got about 3,500 files that consist of single line character strings. The files vary in size (from about 200b to 1mb). I'm trying to compare each file with each other file and find a common subsequence of length 20 characters between two files. Note that the subsequence is only common between two files during each comparison, and not common among all files.

I've stuggled with this problem a bit, and since I'm not an expert, I've ended up with a bit of an ad-hoc solution. I use itertools.combinations to build a list in Python that ends up with around 6,239,278 combinations. I then pass the files two at a time to a Perl script that acts a wrapper for a suffix tree library written in C called libstree. I've tried to avoid this type of solution but the only comparable C suffix tree wrapper in Python suffers from a memory leak.

So here's my problem. I've timed it, and on my machine, the solution processes about 500 comparisons in 25 seconds. So that means, it'll take around 3 days of continuous processing to complete the task. And then I have to do it all again to look at say 25 characters instead of 20. Please note that I'm way out of my comfort zone and not a very good programmer, so I'm sure there is a much more elegant way to do this. I thought I'd ask it here and produce my code to see if anyone has any suggestion as to how I could complete this task faster.

Python code:

from itertools import combinations
import glob, subprocess

glist = glob.glob("Data/*.g")
i = 0

for a,b in combinations(glist, 2):
    i += 1
    p = subprocess.Popen(["perl", "suffix_tree.pl", a, b, "20"], shell=False, stdout=subprocess.PIPE)
    p = p.stdout.read()
    a = a.split("/")
    b = b.split("/")
    a = a[1].split(".")
    b = b[1].split(".")
    print str(i) + ":" + str(a[0]) + " --- " + str(b[0])
    if p != "" and len(p) == 20:
        with open("tmp.list", "a") as openf:
            openf.write(a[0] + " " + b[0] + "\n")

Perl code:

use strict;
use Tree::Suffix;

open FILE, "<$ARGV[0]";
my $a = do { local $/; <FILE> };

open FILE, "<$ARGV[1]";
my $b = do { local $/; <FILE> };

my @g = ($a,$b);

my $st  = Tree::Suffix->new(@g);
my ($c) = $st->lcs($ARGV[2],-1);

print "$c";
like image 949
mstcamus Avatar asked Oct 07 '22 04:10

mstcamus


1 Answers

Rather than writing Python to call Perl to call C, I'm sure you would be better off dropping the Python code and writing it all in Perl.

If your files are certain to contain exactly one line then you can read the pairs more simply by writing just

my @g = <>;

I believe the program below performs the same function as your Python and Perl code combined, but I cannot test it as I am unable to install libstree at present.

But as ikegami has pointed out, it would be far better to calculate and store the longest common subsequence for each pair of files and put them into categories afterwards. I won't go on to code this as I don't know what information you need - whether it is just subsequence length or if you need the characters and/or the position of the subsequences as well.

use strict;
use warnings;

use Math::Combinatorics;
use Tree::Suffix;

my @glist = glob "Data/*.g";
my $iterator = Math::Combinatorics->new(count => 2, data => \@glist);

open my $fh, '>', 'tmp.list' or die $!;

my $n = 0;
while (my @pair = $iterator->next_combination) {
  $n++;
  @ARGV = @pair;
  my @g = <>;
  my $tree  = Tree::Suffix->new(@g);
  my $lcs = $tree->lcs;
  @pair = map m|/(.+?)\.|, @pair;
  print "$n: $pair[0] --- $pair[1]\n";
  print $fh, "@pair\n" if $lcs and length $lcs >= 20;
}
like image 107
Borodin Avatar answered Oct 12 '22 14:10

Borodin