I have a directory with a large number of files (~1mil). I need to choose a random file from this directory. Since there are so many files, os.listdir
naturally takes an eternity to finish.
Is there a way I can circumvent this problem? Maybe somehow get to know the number of files in the directory (without listing it) and choose the 'n'th file where n is randomly generated?
The files in the directory are randomly named.
Alas, I don't think there is a solution to your problem. One, I don't know of portable API that will return you the number of entries in directory (w/o enumerating them first). Two, I don't think there is API to return you directory entry by number and not by name.
So overall, a program will have to enumerate O(n) directory entries to get a single random one. The trivial approach of determining number of entries and then picking one will either require enough RAM to hold the full listing (os.listdir()
) or will have to enumerate 2nd time the directory to find the random(n) item - overall n+n/2
operations on average.
There is slightly better approach - but only slightly - see randomly-selecting-lines-from-files. In short there is a way to pick random item from list/iterator with unknown length, while reading one item at a time and ensure that any item may be picked with equal probability. But this won't help with os.listdir()
because it already returns list
in memory that already contains all 1M+ entries - so you can as well ask it about len()
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With