Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Perl's Glob have a limitation?

Tags:

perl

I am running the following expecting return strings of 5 characters:

while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}'x5) {
  print "$_\n";
}

but it returns only 4 characters:

anbc
anbd
anbe
anbf
anbg
...

However, when I reduce the number of characters in the list:

while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m}'x5) {
  print "$_\n";
}

it returns correctly:

aamid
aamie
aamif
aamig
aamih
...

Can someone please tell me what I am missing here, is there a limit of some sort? or is there a way around this?

If it makes any difference, It returns the same result in both perl 5.26 and perl 5.28

like image 652
Gerry Avatar asked Nov 16 '19 06:11

Gerry


2 Answers

The glob first creates all possible file name expansions, so it will first generate the complete list from the shell-style glob/pattern it is given. Only then will it iterate over it, if used in scalar context. That's why it's so hard (impossible?) to escape the iterator without exhausting it; see this post.

In your first example that's 265 strings (11_881_376), each five chars long. So a list of ~12 million strings, with (naive) total in excess of 56Mb ... plus the overhead for a scalar, which I think at minimum is 12 bytes or such. So at the order of a 100Mb's, at the very least, right there in one list.

I am not aware of any formal limits on lengths of things in Perl (other than in regex) but glob does all that internally and there must be undocumented limits -- perhaps some buffers are overrun somewhere, internally? It is a bit excessive.

As for a way around this -- generate that list of 5-char strings iteratively, instead of letting glob roll its magic behind the scenes. Then it absolutely should not have a problem.

However, I find the whole thing a bit big for comfort, even in that case. I'd really recommend to write an algorithm that generates and provides one list element at a time (an "iterator"), and work with that.

There are good libraries that can do that (and a lot more), some of which are Algorithm::Loops recommended in a previous post on this matter (and in a comment), Algorithm::Combinatorics (same comment), Set::CrossProduct from another answer here ...

Also note that, while this is a clever use of glob, the library is meant to work with files. Apart from misusing it in principle, I think that it will check each of (the ~ 12 million) names for a valid entry! (See this page.) That's a lot of unneeded disk work. (And if you were to use "globs" like * or ? on some systems it returns a list with only strings that actually have files, so you'd quietly get different results.)


 I'm getting 56 bytes for a size of a 5-char scalar. While that is for a declared variable, which may take a little more than an anonymous scalar, in the test program with length-4 strings the actual total size is indeed a good order of magnitude larger than the naively computed one. So the real thing may well be on the order of 1Gb, in one operation.

Update   A simple test program that generates that list of 5-char long strings (using the same glob approach) ran for 15-ish minutes on a server-class machine and took 725 Mb of memory.

It did produce the right number of actual 5-char long strings, seemingly correct, on this server.

like image 155
zdim Avatar answered Nov 18 '22 09:11

zdim


Everything has some limitation.

Here's a pure Perl module that can do it for you iteratively. It doesn't generate the entire list at once and you start to get results immediately:

use v5.10;

use Set::CrossProduct;

my $set = Set::CrossProduct->new( [ ([ 'a'..'z' ]) x 5 ] );

while( my $item = $set->get ) {
    say join '', @$item
    }
like image 35
brian d foy Avatar answered Nov 18 '22 08:11

brian d foy