Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I speed up Perl's readdir for a directory with 250,000 files?

I am using Perl readdir to get file listing, however, the directory contains more than 250,000 files and this results long time (longer than 4 minutes) to perform readdir and uses over 80MB of RAM. As this was intended to be a recurring job every 5 minutes, this lag time will not be acceptable.

More info: Another job will fill the directory (once per day) being scanned. This Perl script is responsible for processing the files. A file count is specified for each script iteration, currently 1000 per run. The Perl script is to run every 5 min and process (if applicable) up to 1000 files. File count limit intended to allow down stream processing to keep up as Perl pushes data into database which triggers complex workflow.

Is there another way to obtain filenames from directory, ideally limited to 1000 (set by variable) which would greatly increase speed of this script?

like image 486
Walinmichi Avatar asked Nov 29 '22 07:11

Walinmichi


2 Answers

What exactly do you mean when you say readdir is taking minutes and 80 MB? Can you show that specific line of code? Are you using readdir in scalar or list context?

Are you doing something like this:

foreach my $file ( readdir($dir) ) { 
   #do stuff here
}

If that's the case, you are reading the entire directory listing into memory. No wonder it takes a long time and a lot of memory.

The rest of this post assumes that this is the problem, if you are not using readdir in list context, ignore the rest of the post.

The fix for this is to use a while loop and use readdir in a scalar context.

while ( 
    defined( my $file = readdir $dir )
 ) {

    # do stuff.

}

Now you only read one item at a time. You can add a counter to keep track of how many files you process, too.

like image 153
daotoad Avatar answered Dec 06 '22 07:12

daotoad


The solution would maybe lie in the other end : at the script that fills the directory...

Why not create an arborescence to store all those files and that way have lots of directories each with a manageable number of files ?

Instead of creating "mynicefile.txt" why not "m/my/mynicefile", or something like that ?

Your file system would thank you for that (especially if you remove the empty directories when you have finished with them).

like image 31
siukurnin Avatar answered Dec 06 '22 09:12

siukurnin