Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What goes on behind the curtains during disk I/O?

Tags:

c

file-io

buffer

When I seek to some position in a file and write a small amount of data (20 bytes), what goes on behind the scenes?

My understanding

To my knowledge, the smallest unit of data that can be written or read from a disk is one sector (traditionally 512 bytes, but that standard is now changing). That means to write 20 bytes I need to read a whole sector, modify some of it in memory and write it back to disk.

This is what I expect to be happening in unbuffered I/O. I also expect buffered I/O to do roughly the same thing, but be clever about its cache. So I would have thought that if I blow locality out the window by doing random seeks and writes, both buffered and unbuffered I/O ought to have similar performance... maybe with unbuffered coming out slightly better.

Then again, I know it's crazy for buffered I/O to only buffer one sector, so I might also expect it to perform terribly.

My application

I am storing values gathered by a SCADA device driver that receives remote telemetry for upwards of a hundred thousand points. There is extra data in the file such that each record is 40 bytes, but only 20 bytes of that needs to be written during an update.

Pre-implementation benchmark

To check that I don't need to dream up some brilliantly over-engineered solution, I have run a test using a few million random records written to a file that could contain a total of 200,000 records. Each test seeds the random number generator with the same value to be fair. First I erase the file and pad it to the total length (about 7.6 meg), then loop a few million times, passing a random file offset and some data to one of two test functions:

void WriteOldSchool( void *context, long offset, Data *data )
{
    int fd = (int)context;
    lseek( fd, offset, SEEK_SET );
    write( fd, (void*)data, sizeof(Data) );
}

void WriteStandard( void *context, long offset, Data *data )
{
    FILE *fp = (FILE*)context;
    fseek( fp, offset, SEEK_SET );
    fwrite( (void*)data, sizeof(Data), 1, fp );
    fflush(fp);
}

Maybe no surprises?

The OldSchool method came out on top - by a lot. It was over 6 times faster (1.48 million versus 232000 records per second). To make sure I hadn't run into hardware caching, I expanded my database size to 20 million records (file size of 763 meg) and got the same results.

Before you point out the obvious call to fflush, let me say that removing it had no effect. I imagine this is because the cache must be committed when I seek sufficiently far away, which is what I'm doing most of the time.

So, what's going on?

It seems to me that the buffered I/O must be reading (and possibly writing all of) a large chunk of the file whenever I try to write. Because I am hardly ever taking advantage of its cache, this is extremely wasteful.

In addition (and I don't know the details of hardware caching on disk), if the buffered I/O is trying to write a bunch of sectors when I change only one, that would reduce the effectiveness of the hardware cache.

Are there any disk experts out there who can comment and explain this better than my experimental findings? =)

like image 272
paddy Avatar asked Nov 01 '12 04:11

paddy


2 Answers

Indeed, at least on my system with GNU libc, it looks like stdio is reading 4kB blocks before writing back the changed portion. Seems bogus to me, but I imagine somebody thought it was a good idea at the time.

I checked by writing a trivial C program to open a file, write a small of data once, and exit; then ran it under strace, to see which syscalls it actually triggered. Writing at an offset of 10000, I saw these syscalls:

lseek(3, 8192, SEEK_SET)                = 8192
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1808) = 1808
write(3, "hello", 5)                    = 5

Seems that you'll want to stick with the low-level Unix-style I/O for this project, eh?

like image 156
Jamey Sharp Avatar answered Nov 15 '22 09:11

Jamey Sharp


The C standard library functions perform additional buffering, and are generally optimized for streaming reads, rather than random IO. On my system, I don't observe the spurious reads that Jamey Sharp saw I only see spurious reads when the offset is not aligned to a page size - it could be that the C library always tries to keep its IO buffer aligned to 4kb or something.

In your case, if you're doing lots of random reads and writes across a reasonably small dataset, you'd likely be best served using pread/pwrite to avoid having to make seeking syscalls, or simply mmaping the dataset and writing to it in memory (likely to be the fastest, if your dataset fits in memory).

like image 5
bdonlan Avatar answered Nov 15 '22 09:11

bdonlan