I plan on getting a huge folder of data. The total size of the folder would be approximately 2TB and it would be comprised of about 2 million files. I will need to perform some processing on those files (mainly removing 99% of them).
I anticipate some issues due to the size of the data. In particular, I would like to know if Python is able to list these files correctly using os.listdir() in a reasonable time.
For instance, I know from experience that in some cases, deleting huge folders like this one on Ubuntu can be painful.
os.scandir was created largely because of issues with using os.listdir on huge directories, so I would expect os.listdir to suffer in the scenario you describe, where os.scandir should perform better, both because it can process the folders with lower memory consumption and because (typically) you benefit at least a little by avoiding per-entry stat calls (e.g. to distinguish files from directories).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With