Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Opening many small files on NTFS is way too slow

I am writing a program that should process many small files, say thousands or even millions. I've been testing that part on 500k files, and the first step was just to iterate a directory which has around 45k directories in it (including subdirs of subdirs, etc), and 500k small files. The traversal of all directories and files, including getting file sizes and calculating total size takes about 6 seconds . Now, if I try to open each file while traversing and close it immediately it looks like it never stops. In fact, it takes way too long (hours...). Since I do this on Windows, I tried opening the files with CreateFileW, _wfopen and _wopen. I didn't read or write anything on the files, although in the final implementation I'll need to read only. However, I didn't see a noticeable improvement in any of the attempts.

I wonder if there's a more efficient way to open the files with any of the available functions, whether it's C, C++ or Windows API, or the only more efficient way will be to read the MFT and read blocks of the disk directly, which I am trying to avoid?

Update: The application that I am working on is doing backup snapshots with versioning. So, it also has incremental backups. The test with 500k files is done on a huge source code repository in order to do versioning, something like a scm. So, all files are not in one directory. There are around 45k directories as well (mentioned above).

So, the proposed solution to zip the files doesn't help, because when the backup is done, that's when all files are accessed. Hence, I'll see no benefit from that, and it'll even incur some performance cost.

like image 411
Amy Avatar asked Jan 08 '15 16:01

Amy


2 Answers

What you are trying to do is intrinsically difficult for any operating system to do efficiently. 45,000 subdirectories requires a lot of disk access no matter how it is sliced.

Any file over about 1,000 bytes is "big" as far as NTFS is concerned. If there were a way to make most data files less than about 900 bytes, you could realize a major efficiency by having the file data stored inside the MFT. Then it would be no more expensive to obtain the data than it is to obtain the file's timestamps or size.

I doubt there is any way to optimize the program's parameters, process options, or even the operating system's tuning parameters to make the application work well. You are faced with multi-hour operation unless you can rearchitect it in a radically different way.

One strategy would be to distribute the files across multiple computers—probably thousands of them—and have a sub-application on each process the local files, feeding whatever results to a master application.

Another strategy would be to re-architect all the files into a few larger files, like big .zip files as suggested by @felicepollano, effectively virtualizing your set of files. Random access to a 4000 GB file is inherently far more efficient and effective use of resources than accessing 4 billion 1 MB files. Also moving all the data into a suitable database manager (MySQL, SQL Server, etc.) would accomplish this and perhaps provide other benefits like easy searches and an easy archival strategy.

like image 102
wallyk Avatar answered Oct 04 '22 11:10

wallyk


An overhead of 5 to 20ms per file isn't abnormal for an NTFS volume with that number of files. (On a conventional spindled drive, you can't expect much better than that anyway, because it's on the same order as the head seek times. From this point on, I'll assume we're dealing with enterprise-class hardware, SSD and/or RAID.)

Based on my experiences, you can significantly increase throughput by parallelizing the requests, i.e., using multiple threads and/or processes. Most of the overhead appears to be per-thread, the system can open ten files at once nearly as quickly as it can open a single file by itself. I'm not sure why this is. You might need to experiment to find the optimum level of parallelization.

The system administrator can also significantly improve performance by copying the contents to a new volume, preferably in approximately the same order that they will be accessed. I had to do this recently, and it reduced backup time (for a volume with about 14 million files) from 85 hours to 18 hours.

You might also try OpenFileById() which may perform better for files in large directories, since it bypasses the need to enumerate the directory tree. However, I've never tried it myself, and it might not have much impact since the directory is likely to be cached anyway if you've just enumerated it.

You can also enumerate the files on the disk more quickly by reading them from the MFT, although it sounds as if that isn't a bottleneck for you at the moment.

like image 27
Harry Johnston Avatar answered Oct 04 '22 12:10

Harry Johnston