Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl Program to efficiently process 500,000 small files in a directory

I am processing a large directory every night. It accumulates around 1 million files each night, half of which are .txt files that I need to move to a different directory according to their contents.

Each .txt file is pipe-delimited and contains only 20 records. Record 6 is the one that contains the information I need to determine which directory to move the file to.

Example Record:

A|CHNL_ID|4

In this case the file would be moved to /out/4.

This script is processing at a rate of 80,000 files per hour.

Are there any recommendations on how I could speed this up?

opendir(DIR, $dir) or die "$!\n";
while ( defined( my $txtFile = readdir DIR ) ) {
    next if( $txtFile !~ /.txt$/ );
    $cnt++;

    local $/;
    open my $fh, '<', $txtFile or die $!, $/;
    my $data  = <$fh>;
    my ($channel) =  $data =~ /A\|CHNL_ID\|(\d+)/i;
    close($fh);

    move ($txtFile, "$outDir/$channel") or die $!, $/;
}
closedir(DIR);
like image 362
DenairPete Avatar asked Dec 13 '22 17:12

DenairPete


1 Answers

You are being hurt by the sheer number of files in a single directory.

I created 80_000 files and ran your script which completed in 5.2 seconds. This is on an older laptop with CentOS7 and v5.16. But with half a million files it takes nearly 7 minutes. Thus the problem is not about the performance of your code per se (but which can also be tightened).

Then one solution is simple: run the script out of a cron, say every hour, as files are coming. While you move the .txt files also move the others elsewhere and there will never be too many files; the script will always run in seconds. In the end you can move those other files back, if needed.

Another option is to store these files on a partition with a different filesystem, say ReiserFS. However, this doesn't at all address the main problem of having way too many files in a directory.

Another partial fix is to replace

while ( defined( my $txtFile = readdir DIR ) )

with

while ( my $path = <"$dir/*txt"> )

which results in a 1m:12s run (as opposed to near 7 minutes). Don't forget to adjust file-naming since <> above returns the full path to the file. Again, this doesn't really deal with the problem.

If you had control over how the files are distributed you would want a 3-level (or so) deep directory structure, which can be named using files' MD5, what would result in a very balanced distribution.


File names and their content were created as

perl -MPath::Tiny -wE'
    path("dir/s".$_.".txt")->spew("A|some_id|$_\n") for 1..500_000
'
like image 58
zdim Avatar answered Dec 29 '22 09:12

zdim