Reducing seek times when reading many small files

Tags:

I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.

The processing time per record is negligible, and the records can be processed in any order, both within and across different files.

A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.

Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?

One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.

I am of course open to any other ideas.

The filesystem is ext4, but that's negotiable.

921

asked Mar 23 '12 13:03

NPE

1 Answers

Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.

answered Oct 15 '22 12:10

Wim Coenen

Related questions
                            
                                Linux /proc/pid/smaps proportional swap (like Pss but for swap)
                            
                                Validate the .m3u8 file on linux
                            
                                Apache Won't Request My SSL Client Certificate
                            
                                Nginx and php-fpm: cannot get rid of 502 and 504 errors
                            
                                High system CPU usage when contending futex
                            
                                Change Linux shared library (.so file) version after it was compiled
                            
                                Determine optical media type (Audio CD, DVD, blu-ray) by using UDEV and scripts
                            
                                Detecting if a character device has disconnected in Linux in with termios api (c++)
                            
                                Unresolved <iostream> in Eclipse, Ubuntu
                            
                                Is there a difference between mmap MAP_SHARED and MAP_PRIVATE when PROT_READ is also used?
                            
                                Change system date time in Docker containers without impacting host
                            
                                How to visualise a graph of C structs that contain / point to one another?
                            
                                How to switch linux kernel console after boot process?
                            
                                Linux AIO: Poor Scaling
                            
                                Custom protocol in linux?
                            
                                Exclusively open a device file in Linux
                            
                                How to kill this immortal nginx worker?
                            
                                Run PHP inside DNS zone Bind 10
                            
                                Set docker image username at container creation time?
                            
                                Getting Linux process resource usage (cpu,disk,network)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reducing seek times when reading many small files

Tags:

language-agnostic

linux

filesystems

NPE

People also ask

1 Answers

Wim Coenen

Recent Activity

Donate For Us