Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to read many 300 bytes chunks randomly by file offset from a 2TB file?

I have some 2TB read only (no writing once created) files on a RAID 5 (4 x 7.2k @ 3TB) system.

Now I have some threads that wants to read portions of that file. Every thread has an array of chunks it needs. Every chunk is addressed by file offset (position) and size (mostly about 300 bytes) to read from.

What is the fastest way to read this data. I don't care about CPU cycles, (disk) latency is what counts. So if possible I want take advantage of NCQ of the hard disks.

As the files are highly compressed and will accessed randomly and I know exactly the position, I have no other way to optimize it.

  • Should I pool the file reading to one thread?
  • Should I keep the file open?
  • Should every thread (maybe about 30) keep every file open simultaneously, what is with new threads that are coming (from web server)?
  • Will it help if I wait 100ms and sort my readings by file offsets (lowest first)?

What is the best way to read the data? Do you have experiences, tips, hints?

like image 523
Chris Avatar asked Jan 17 '12 16:01

Chris


1 Answers

The optimum number of parallel requests depends highly on factors outside your app (e.g. Disk count=4, NCQ depth=?, driver queue depth=? ...), so you might want to use a system, that can adapt or be adapted. My recommendation is:

  • Write all your read requests into a queue together with some metadata that allows to notify the requesting thread
  • have N threads dequeue from that queue, synchronously read the chunk, notify the requesting thread
  • Make N runtime-changeable
  • Since CPU is not your concern, your worker threads can calculate a floating latency average (and/or maximum, depending on your needs)
  • Slide N up and down, until you hit the sweet point

Why sync reads? They have lower latency than ascync reads. Why waste latency on a queue? A good lockless queue implementation starts at less than 10ns latency, much less than two thread switches

Update: Some Q/A

Should the read threads keep the files open? Yes, definitly so.

Would you use a FileStream with FileOptions.RandomAccess? Yes

You write "synchronously read the chunk". Does this mean every single read thread should start reading a chunk from disk as soon as it dequeues an order to read a chunk? Yes, that's what I meant. The queue depth of read requests is managed by the thread count.

like image 166
Eugen Rieck Avatar answered Oct 25 '22 14:10

Eugen Rieck