Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C program stuck on uninterruptible wait while performing disk I/O on Mac OS X Snow Leopard

One line of background: I'm the developer of Redis, a NoSQL database. One of the new features I'm implementing is Virtual Memory, because Redis takes all the data in memory. Thanks to VM Redis is able to transfer rarely used objects from memory to disk, there are a number of reasons why this works much better than letting the OS do the work for us swapping (redis objects are built of many small objects allocated in non contiguous places, when serialized to disk by Redis they take 10 times less space compared to the memory pages where they live, and so forth).

Now I've an alpha implementation that's working perfectly on Linux, but not so well on Mac OS X Snow Leopard. From time to time, while Redis tries to move a page from memory to disk, the redis process enters the uninterruptible wait state for minutes. I was unable to debug this, but this happens either in a call to fseeko() or fwrite(). After minutes the call finally returns and redis continues working without problems at all: no crash.

The amount of data transfered is very small, something like 256 bytes. So it should not be a matter of a very big amount of I/O performed.

But there is an interesting detail about the swap file that's target of the write operation. It's a big file (26 Gigabytes) created opening a file with fopen() and then enlarged using ftruncate(). Finally the file is unlink()ed so that Redis continues to take a reference to it, but we are sure that when the Redis process will exit the OS will really free the swap file.

Ok that's all but I'm here for any further detail. And BTW you can even find the actual code in the Redis git, but it's not trivial to understand in five minutes given that's a fairly complex system.

Thank you very much for any help.

like image 694
antirez Avatar asked Jan 07 '10 00:01

antirez


2 Answers

As I understand it, HFS+ has very poor support for sparse files. So it may be that your write is triggering a file expansion that is initializing/materializing a large fraction of the file.

For example, I know mmap'ing a new large empty file and then writing at a few random locations produces a very large file on disk with HFS+. It's quite annoying since mmap and sparse files are an extremely convenient way of working with data, and virtually every other platform/filesystem out there handles this gracefully.

Is the swap file written to linearly? Meaning we either replace an existing block or write a new block at the end and increment a free space pointer? If so, perhaps doing more frequent smaller ftruncate calls to expand the file would result in shorter pauses.

As an aside, I'm curious why redis VM doesn't use mmap and then just move blocks around in an attempt to concentrate hot blocks into hot pages.

like image 56
Jason Watkins Avatar answered Sep 18 '22 14:09

Jason Watkins


antirez, I'm not sure I'll be much help since my Apple experience is limited to the Apple ][, but I'll give it a shot.

First thing is a question. I would have thought that, for virtual memory, speed of operation would be a more important measure than disk space (especially for a NoSQL DB where speed is the whole point, otherwise you'd be using SQL, no?). But, if your swap file is 26G, maybe not :-)

Some things to try (if possible).

  1. Try to actually isolate the problem to the seek or write. I have a hard time believing a seek could take that long since, at worst, it should be a buffer pointer change. Still, I didn't write OSX so I can't be sure.
  2. Try adjusting the size of the swap file to see if that's what is causing the problem.
  3. Do you ever dynamically expand the swap file (as opposed to pre-allocation)? If you do, that may be what is causing the problem.
  4. Do you always write as low in the file as you can? It may be that creating a 26G file may not actually fill it with data but, if you create it then write to the last byte, the OS may have to zero out the bytes before then (deferring the initialization, if any).
  5. What happens if you just pre-allocate the entire file (write to every byte) and not unlink it? In other words, leave the file there between runs of your program (creating it if it doesn't already exist of course). Then in your startup code for Redis, just initialize the file (pointers and such). This may get rid of any problems like those in point 4 above.
  6. Ask on the various BSD sites as well. I'm not sure how much Apple changed under the covers but OSX is just BSD at the lowest level (Pax ducks for cover).
  7. Also consider asking on the Apple sites (if you haven't already done so).

Well, that's my small contribution, hopefully it'll help. Good luck with your project.

like image 30
paxdiablo Avatar answered Sep 17 '22 14:09

paxdiablo