Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing locations of on-disk data for sequential access

I need to store large amounts of data on-disk in approximately 1k blocks. I will be accessing these objects in a way that is hard to predict, but where patterns probably exist.

Is there an algorithm or heuristic I can use that will rearrange the objects on disk based on my access patterns to try to maximize sequential access, and thus minimize disk seek time?

like image 288
sanity Avatar asked Dec 05 '08 14:12

sanity


1 Answers

On modern OSes (Windows, Linux, etc) there is absolutely nothing you can do to optimise seek times! Here's why:

  1. You are in a pre-emptive multitasking system. Your application and all it's data can be flushed to disk at any time - user switches task, screen saver kicks in, battery runs out of charge, etc.
  2. You cannot guarantee that the file is contiguous on disk. Doing Aaron's first bullet point will not ensure an unfragmented file. When you start writing the file, the OS doesn't know how big the file is going to be so it could put it in a small space, fragmenting it as you write more data to it.
  3. Memory mapping the file only works as long as the file size is less than the available address range in your application. On Win32, the amount of address space available is about 2Gb - memory used by application. Mapping larger files usually involves un-mapping and re-mapping portions of the file, which won't be the best of things to do.
  4. Putting data in the centre of the file is no help as, for all you know, the central portion of the file could be the most fragmented bit.

To paraphrase Raymond Chen, if you have to ask about OS limits, you're probably doing something wrong. Treat your filesystem as an immutable black box, it just is what it is (I know, you can use RAID and so on to help).

The first step you must take (and must be taken whenever you're optimising) is to measure what you've currently got. Never assume anything. Verify everything with hard data.

From your post, it sounds like you haven't actually written any code yet, or, if you have, there is no performance problem at the moment.

The only real solution is to look at the bigger picture and develop methods to get data off the disk without stalling the application. This would usually be through asynchronous access and speculative loading. If your application is always accessing the disk and doing work with small subsets of the data, you may want to consider reorganising the data to put all the useful stuff in one place and the other data elsewhere. Without knowing the full problem domain it's not possible to to be really helpful.

like image 189
Skizz Avatar answered Oct 19 '22 10:10

Skizz