I have an application which periodically (after each 1 or 2 seconds) takes checkpoints by forking itself. So checkpoint is a fork of the original process which just stays idle until it is asked to start when some error in the original process occurs. Now my question is how costly is the copy-on-write mechanism of fork. How much is the cost of a page fault trap that will occur whenever the original process writes to a memory page (first time after taking a checkpoint that is), as copy-on-write mechanism will make sure that it gives the original process a different physical page than the checkpoint. In my opinion, the page fault trap overhead could be quite high as an interrupt occurs, we land from user-space land to the kernel space land and then back from kernel to user-space. How many CPU cycles can I lose from such a a page fault trap. Assume that the RAM is big enough and we don't ever need to swap to the hard disk. Well I know that its difficult to imagine a checkpointing scheme more efficient than this and therefore you could say why I'm worrying about page trap fault overhead, but I'm asking just to have an idea how much cost will be there for this scheme.

You can do the rough math for an educated guess yourself. Assuming no disk access (~10 billion cycles), you have to account for <ul> <li>160 cycles for the trap and returning (approximately, on x86_64)</li> <li>validity checks, quota, accounting, and whatnot (unknown, probably a few hundred to a thousand cycles)</li> <li>aligned <code>memcpy</code> of 4096 bytes, something around 500-800 cycles</li> <li>TLB invalidation (adds 10-100 cycles on first access)</li> <li>either eviction of other cached data or one guaranteed cache miss (80-400 cycles) depending on the implementation of the <code>memcpy</code>. It matters a lot on your access pattern whether one or the other is better.</li> </ul> So all in all, we're talking of something around 2000 cycles, with some of the effects (e.g. TLB and cache effects) being spread out and not immediately visible. Omondi and Sedukhin reported 1700 cycles on P-III back in 2003, which is consistent with this estimate. Note that if the page has never been written to before, things are slightly different according to a comment by L. Torvalds back in 2000. A copy-on-write miss on a zero page pulls another zero page from the pool and doesn't copy zeroes. That's pretty much a guaranteed cache miss too, though.

Cost of a page fault trap

Tags:

c

memory-management

linux

x86-64

intel

I have an application which periodically (after each 1 or 2 seconds) takes checkpoints by forking itself. So checkpoint is a fork of the original process which just stays idle until it is asked to start when some error in the original process occurs.

Now my question is how costly is the copy-on-write mechanism of fork. How much is the cost of a page fault trap that will occur whenever the original process writes to a memory page (first time after taking a checkpoint that is), as copy-on-write mechanism will make sure that it gives the original process a different physical page than the checkpoint.

In my opinion, the page fault trap overhead could be quite high as an interrupt occurs, we land from user-space land to the kernel space land and then back from kernel to user-space. How many CPU cycles can I lose from such a a page fault trap. Assume that the RAM is big enough and we don't ever need to swap to the hard disk.

Well I know that its difficult to imagine a checkpointing scheme more efficient than this and therefore you could say why I'm worrying about page trap fault overhead, but I'm asking just to have an idea how much cost will be there for this scheme.

236

asked Apr 19 '12 07:04

pythonic

1 Answers

You can do the rough math for an educated guess yourself. Assuming no disk access (~10 billion cycles), you have to account for

160 cycles for the trap and returning (approximately, on x86_64)
validity checks, quota, accounting, and whatnot (unknown, probably a few hundred to a thousand cycles)
aligned memcpy of 4096 bytes, something around 500-800 cycles
TLB invalidation (adds 10-100 cycles on first access)
either eviction of other cached data or one guaranteed cache miss (80-400 cycles) depending on the implementation of the memcpy. It matters a lot on your access pattern whether one or the other is better.

So all in all, we're talking of something around 2000 cycles, with some of the effects (e.g. TLB and cache effects) being spread out and not immediately visible. Omondi and Sedukhin reported 1700 cycles on P-III back in 2003, which is consistent with this estimate.

Note that if the page has never been written to before, things are slightly different according to a comment by L. Torvalds back in 2000. A copy-on-write miss on a zero page pulls another zero page from the pool and doesn't copy zeroes. That's pretty much a guaranteed cache miss too, though.

107

answered Sep 28 '22 23:09

Damon

Related questions
                            
                                strtod underflow, return value != 0
                            
                                Which section in C89 standard allows the "implicit int" rule?
                            
                                Where can I get windows.h for Mac?
                            
                                How can I bitwise XOR two C char arrays?
                            
                                Find whether a 2d matrix is subset of another 2d matrix
                            
                                Is there a way to use GCC __attribute__((noreturn)) and <stdnoreturn.h> sanely in a single translation unit?
                            
                                How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
                            
                                Why define a macro to a function with the same name?
                            
                                Emulating variable bit-shift using only constant shifts?
                            
                                Automated field re-ordering in C structs to avoid padding
                            
                                Launch OpenGL app straight from a windowless Linux Terminal
                            
                                What does collect2.exe do?
                            
                                how refer to a local variable share same name of a global variable in C? [duplicate]
                            
                                Optimizing variable-length encoding
                            
                                Wrap C struct with array member for access in python: SWIG? cython? ctypes?
                            
                                Creating a unique temporary directory from pure C in windows
                            
                                In Microchip C18, why does the insertion of a NOP cause much larger code?
                            
                                Cannot reproduce segfault in gdb
                            
                                C - 'char **' differs in levels of indirection from 'char (*)[6]'
                            
                                Alternative to __uuidof in C

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With