Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reducing seek times when reading many small files

I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.

The processing time per record is negligible, and the records can be processed in any order, both within and across different files.

A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.

Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?

One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.

I am of course open to any other ideas.

The filesystem is ext4, but that's negotiable.

like image 921
NPE Avatar asked Mar 23 '12 13:03

NPE


People also ask

Why do many small files take longer to copy?

When that block is filled up, the HDD will look for other small blocks of free space that it can use to fill with data, which causes the head to move around a lot. That, along with the overhead mentioned in these linked articles causes small files to take a long time to copy.

How do I stop disk seek?

Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself.


1 Answers

Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.

like image 63
Wim Coenen Avatar answered Oct 15 '22 12:10

Wim Coenen