Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

comparing 2 data sets possibly with concurrency/asynchronous/parallel approach

I am currently trying to improve upon an existing mechanism (to compare data from 2 sources, implemented in perl5) and would like to use perl6 instead.

My target data volume range is about 20-30 GB in uncompressed flat files. In terms of lines, a file can contain anywhere from 18 million to 28 million lines. It has around 40-50 columns per line.

I do this type of data reconciliation on a daily basis and it can take about ~10 minutes to read from a file and populate the hash. ~20 minutes spent to read both files and to populate hash.

comparison process takes about ~30-50 minutes including iterating over hash, collecting desired result(s), and writing to output file (csv,psv).

All in all it can take anywhere between 30 minutes to 60 minutes on a 32 core dual xeon cpu server with 256gb of RAM, including intermittent server load, to perform the process.

Now I am trying to bring down the total processing time even further.

Here is my current single threaded approach using perl5.

  1. fetch data from 2 sources (let's say s1 and s2) one by one and populate my hash based on key-value pairs. Source of data could be either a flat csv or psv file OR a database query Array of Array result, via DBI client. Data is always unsorted to start with.
  2. To be specific, I read the file line by line,split fields, and choose desired indexes for key,value pair and insert into hash.
  3. After collecting data and populating hash with desired key/value pairs,I start to compare and collect results (mainy comparing on what is missing or different in s2 w.r.t s1 and vice-versa).
  4. dump output in an excel file (very costly if no. of lines is large like ~1 million or greater) or in a simple CSV (cheap operation. preferred method).

I was wondering whether if I could somehow do the first step in parallel i.e. collect data from both sources at once and populate my global hash, and then proceed to compare and dump output?

What options can perl6 provide to deal with this situation? I have read about concurrency, asynchronous and parallel operations using perl6 but I am not so certain which one can help me here.

I would really appreciate any general guidance on the matter. I hope I explained my problem well but sadly I don't have much to show for what have I tried till now? and reason is that I am just beginning to tackle this one. I am just unable to see past the single threaded approach and need some help.

Thanks.

EDIT

As my existing problem statement has been deemed by the community as 'too broad' - allow me to attempt to highlight my pain points below:

  1. I would like to do file comparison by utilizing all 32 cores if possible. I am just not able to come up with a strategy or initial idea.
  2. What type of new techniques are available or applicable with perl6 in order to tackle this problem or type of problem.
  3. If I spawn 2 processes to read file(s) and collect data - is it possible to get the result back as an array or hash?
  4. Is it possible to compare the data (stored in hash) in parallel?

My current p5 comparison logic is shown below for your reference. Hope this helps and not let this question shutdown.

package COMP;

use strict;
use Data::Dumper;


sub comp 
{
  my ($data,$src,$tgt) = @_; 
  my $result = {};

  my $ms    = ($result->{ms} = {});
  my $mt    = ($result->{mt} = {});
  my $diff  = ($result->{diff} = {});

  foreach my $key (keys %{$data->{$src}})
  {
    my $src_val = $data->{$src}{$key};
    my $tgt_val = $data->{$tgt}{$key};

    next if ($src_val eq $tgt_val);

    if (!exists $data->{$tgt}{$key}) {
      push (@{$mt->{$key}}, "$src_val|NULL");
    }
    if (exists $data->{$tgt}{$key} && $src_val ne $tgt_val) {
      push (@{$diff->{$key}}, "$src_val|$tgt_val") 
    }
  }

  foreach my $key (keys %{$data->{$tgt}})
  {
    my $src_val = $data->{$src}{$key};
    my $tgt_val = $data->{$tgt}{$key};

    next if ($src_val eq $tgt_val);

    if (!exists $data->{$src}{$key}) {
      push (@{$ms->{$key}},"NULL|$tgt_val");
    }
  } 

  return $result;
}

1;

If someone would like to try it out, here is the sample output and the test script used.

script output

[User@Host:]$ perl testCOMP.pl 
$VAR1 = {
          'mt' => {
                    'Source' => [
                                  'source|NULL'
                                ]
                  },
          'ms' => {
                    'Target' => [
                                  'NULL|target'
                                ]
                  },
          'diff' => {
                      'Sunday_isit' => [
                                         'Yes|No'
                                       ]
                    }
        };

Test Script

[User@Host:]$  cat testCOMP.pl 
#!/usr/bin/env perl

use lib $ENV{PWD};
use COMP;
use strict;
use warnings;
use Data::Dumper;

my $data2 = {
  f1 => {
    Amitabh => 'Bacchan',
    YellowSun => 'Yes', 
    Sunday_isit => 'Yes',
    Source => 'source',
  },
  f2 => {
    Amitabh => 'Bacchan',
    YellowSun => 'Yes', 
    Sunday_isit => 'No',
    Target => 'target',
  },
};

my $result = COMP::comp ($data2,'f1','f2');
print Dumper $result;
[User@Host:]$ 
like image 611
User9102d82 Avatar asked Aug 29 '18 16:08

User9102d82


1 Answers

If you have an existing and working toolchain you don't have to rewrite it all to use Perl6. It's parallelism mechanisms work fine with external processess too. Consider

allnum.pl6

use v6;

my @processes = 
    [ "num1.txt", "num2.txt", "num3.txt", "num4.txt", "num5.txt" ]
        .map( -> $filename {
            [ $filename, run "perl", "num.pl", $filename, :out ];
        })
        .hyper;

say "Lazyness Here!";
my $time = time;
for @processes
{
    say "<{$_[0]} : {$_[1].out.slurp}>";
}
say time - $time, "s";

num.pl

use warnings;
use strict;

my $file = shift @ARGV;
my $start = time;
my $result = 0;

open my $in, "<", $file or die $!;
while (my $thing = <$in>)
{
    chomp $thing;
    $thing =~ s/ //g;
    $result = ($result + $thing) / 2;
}
print $result, " : ", time - $start, "s";

On my system

C:\Users\holli\tmp>perl6 allnum.pl6
Lazyness Here!
<num1.txt : 7684.16347578616 : 3s>
<num2.txt : 3307.36261498186 : 7s>
<num3.txt : 5834.32817942962 : 10s>
<num4.txt : 6575.55944995197 : 0s>
<num5.txt : 6157.63100049619 : 0s>
10s

Files were set up like so

C:\Users\holli\tmp>perl -e "for($i=0;$i<10000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num1.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<20000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num2.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<30000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num3.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<400000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num4.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<5000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num5.txt
like image 139
Holli Avatar answered Sep 21 '22 08:09

Holli