I need to write a sorting program in C and it would be nice if the file could be sorted in place to save disk space. The data is valuable, so I need to ensure that if the process is interrupted (ctrl-c) the file is not corrupted. I can guarantee the power cord on the machine will not be yanked.
Extra details: file is ~40GB, records are 128-bit, machine is 64-bit, OS is POSIX
Any hints on accomplishing this, or notes in general?
Thanks!
To clarify: I expect the user will want to ctrl-c the process. In this case, I want to exit gracefully and ensure that the data is safe. So this question is about handling interrupts and choosing a sort algorithm that can wrap up quickly if requested.
Following up (2 years later): Just for posterity, I have installed the SIGINT handler and it worked great. This does not protect me against power failure, but that is a risk I can handle. Code at https://code.google.com/p/pawnsbfs/source/browse/trunk/hsort.c and https://code.google.com/p/pawnsbfs/source/browse/trunk/qsort.c
Jerry's right, if it's just Ctrl-C you're worried about, you can ignore SIGINT for periods at a time. If you want to be proof against process death in general, you need some sort of minimal journalling. In order to swap two elements:
1) Add a record to a control structure at the end of the file or in a separate file, indicating which two elements of the file you are going to swap, A and B.
2) Copy A to the scratch space, record that you've done so, flush.
3) Copy B over A, then record in the scratch space that you have done so, flush
4) Copy from the scratch space over B.
5) Remove the record.
This is O(1) extra space for all practical purposes, so still counts as in-place under most definitions. In theory recording an index is O(log n) if n can be arbitrarily large: in reality it's a very small log n, and reasonable hardware / running time bounds it above at 64.
In all cases when I say "flush", I mean commit the changes "far enough". Sometimes your basic flush operation only flushes buffers within the process, but it doesn't actually sync the physical medium, because it doesn't flush buffers all the way through the OS/device driver/hardware levels. That's sufficient when all you're worried about is process death, but if you're worried about abrupt media dismounts then you'd have to flush past the driver. If you were worried about power failure, you'd have to sync the hardware, but you're not. With a UPS or if you think power cuts are so rare you don't mind losing data, that's fine.
On startup, check the scratch space for any "swap-in-progress" records. If you find one, work out how far you got and complete the swap from there to get the data back into a sound state. Then start your sort over again.
Obviously there's a performance issue here, since you're doing twice as much writing of records as before, and flushes/syncs may be astonishingly expensive. In practice your in-place sort might have some compound moving-stuff operations, involving many swaps, but which you can optimise to avoid every element hitting the scratch space. You just have to make sure that before you overwrite any data, you have a copy of it safe somewhere and a record of where that copy should go in order to get your file back to a state where it contains exactly one copy of each element.
Jerry's also right that true in-place sorting is too difficult and slow for most practical purposes. If you can spare some linear fraction of the original file size as scratch space, you'll have a much better time of it with a merge sort.
Based on your clarification, you wouldn't need any flush operations even with an in-place sort. You need scratch space in memory that works the same way, and that your SIGINT handler can access in order to get the data safe before exiting, rather than restoring on startup after an abnormal exit, and you need to access that memory in a signal-safe way (which technically means using a sig_atomic_t
to flag which changes have been made). Even so, you're probably better off with a mergesort than a true in-place sort.
Install a handler for SIGINT
that just sets a "process should exit soon" flag.
In your sort, check the flag after every swap of two records (or after every N swaps). If the flag is set, bail out.
The part for protecting against ctrl-c is pretty easy: signal(SIGINT, SIG_IGN);
.
As far as the sorting itself goes, a merge sort generally works well for external sorting. The basic idea is to read as many records into memory as you can, sort them, then write them back out to disk. By far the easiest way to handle this is to write each run to a separate file on disk. Then you merge those back together -- read the first record from each run into memory, and write the smallest of those out to the original file; read another record from the run that supplied that record, and repeat until done. The last phase is the only time you're modifying the original file, so it's the only time you really need to assure against interruptions and such.
Another possibility is to use a selection sort. The bad point is that the sorting itself is quite slow. The good point is that it's pretty easy to write it to survive almost anything, without using much extra space. The general idea is pretty simple: find the smallest record in the file, and swap that into the first spot. Then find the smallest record of what's left, and swap that into the second spot, and so on until done. The good point of this is that journaling is trivial: before you do a swap, you record the values of the two records you're going to swap. Since the sort runs from the first record to the last, the only other thing you need to track is how many records are already sorted at any given time.
Use heap sort, and prevent interruptions (e.g. block signals) during each swap operation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With