Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing a random file from a directory (with a large number of files) in Python

Tags:

python

file

I have a directory with a large number of files (~1mil). I need to choose a random file from this directory. Since there are so many files, os.listdir naturally takes an eternity to finish.

Is there a way I can circumvent this problem? Maybe somehow get to know the number of files in the directory (without listing it) and choose the 'n'th file where n is randomly generated?

The files in the directory are randomly named.

like image 968
NoneType Avatar asked Jul 14 '10 14:07

NoneType


1 Answers

Alas, I don't think there is a solution to your problem. One, I don't know of portable API that will return you the number of entries in directory (w/o enumerating them first). Two, I don't think there is API to return you directory entry by number and not by name.

So overall, a program will have to enumerate O(n) directory entries to get a single random one. The trivial approach of determining number of entries and then picking one will either require enough RAM to hold the full listing (os.listdir()) or will have to enumerate 2nd time the directory to find the random(n) item - overall n+n/2 operations on average.

There is slightly better approach - but only slightly - see randomly-selecting-lines-from-files. In short there is a way to pick random item from list/iterator with unknown length, while reading one item at a time and ensure that any item may be picked with equal probability. But this won't help with os.listdir() because it already returns list in memory that already contains all 1M+ entries - so you can as well ask it about len() ...

like image 99
Nas Banov Avatar answered Oct 25 '22 03:10

Nas Banov