Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

perl remove non ascii chars too slow compared to system call to C program

Tags:

c

ascii

perl

I have a perl code that is used to remove non ascii chars and ^M But for a 1.5GB file this takes 14 min (On Linux 64 Bit , 16GB RAM , LANG=C set)

I just rewrote the same logic in C and now doing a system call in perl This is taking only 36s ( Same machine , same file )

Why is this perl code slow (I have tried to replicate a small section of the project ). It takes 30x the time as compared to the C program below

PERL

my $newfile = $ARGV[0] . ".cleaned";
open(IN,$ARGV[0]) || die "Could not open file $ARGV[0] $!";
open(OUT,">$newfile");
while(defined($c = getc(IN))){
    next if((($i =ord($c)) > 126)  || ($i == 13  ));
    print OUT $c;
}
close IN;
rename($newfile,$ARGV[0]);

C

#include <stdio.h>
int main(int argc, char **argv){
  char tmpfile[200];
  FILE *fin,*fout;
  int c;
  snprintf(tmpfile,180,"%s.cleaned",argv[1]);
  fin = fopen(argv[1], "rb");
  fout = fopen(tmpfile, "w");
  while ((c = fgetc(fin)) != EOF) {
    if (c==13 || c > 126) continue;
    fputc(c, fout);
  }
  fclose(fin);
  fclose(fout);
  rename(tmpfile,argv[1]);
  return 0;
}
like image 647
Ram Avatar asked Jan 08 '15 08:01

Ram


1 Answers

Reading in chunks of 1M and using tr/// translation should perform faster than reading one byte at the time,

perl -i -pe 'BEGIN{ $/ = \1024**2 } tr|\r\x7F-\xFF||d' file
like image 50
mpapec Avatar answered Oct 03 '22 19:10

mpapec