Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

glob.iglob results ordered by name?

I need to iterate through a potentially very large directory (arbitrarily large). From what I understand, the regular glob.glob function stores a list of all the matching filenames in memory, but the glob.iglob function uses an iterator. So using the regular glob.glob function is out of the question, since there may be A lot of files in the directory.

My problem is that iglob iterates through the directory in a seemingly random order. I would like it to iterate through the files in alphabetical order. I cannot get a list of all the filenames at once, and just sort them, so I am wondering if there is a way to make iglob iterate through the directory in alphabetical order.

like image 576
Nick Avatar asked Nov 11 '12 21:11

Nick


2 Answers

No, there isn't, not without reading all the contents of the directory into memory. The operating system provides the filenames in directory order, and would need to read the contents into memory in full as well if it wanted to sort these.

You could sort the results after iglob() matched files, provided that set is small enough to fit into memory by calling sorted() on the iglob() output:

for filename in sorted(iglob(path)):

Note that iglob() already loads all entries of a single directory into a list when not recursing to subdirectories (partly because fnmatch() returns a list).

like image 66
Martijn Pieters Avatar answered Sep 21 '22 23:09

Martijn Pieters


From the glob module's documentation:

The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. No tilde expansion is done, but *, ?, and character ranges expressed with [] will be correctly matched. This is done by using the os.listdir() and fnmatch.fnmatch() functions in concert, and not by actually invoking a subshell.

And if we look the documentation for os.listdir:

os.listdir(path)

Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order. It does not include the special entries '.' and '..' even if they are present in the directory.

So glob.glob does not return the files in alphabetical order. It is not stated anywhere in the documentation. Relying on this behaviour is a bug. If you want an ordered sequence you must sort the result. You can then easily imagine that there is no way to make iglob return a sorted result since it does not even have all results available.

If memory is really a problem then you have two choices:

  1. Drop the "aplhabetical order" requirement and just use iglob.
  2. Sort the data using some kind of "bucket sorting", keeping most of the data on disk and loading it into RAM in chunks (such techniques are explained in The Art of Computer Programming, Book 3). This approach will make your program slower and probably much harder to write. But if you really can not hold all the filenames in RAM then you'll have to save them on disk eventually.
like image 37
Bakuriu Avatar answered Sep 18 '22 23:09

Bakuriu