Software memory bit-flip detection for platforms without ECC

Tags:

Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study "Data integrity": "Bit Error Rate of 10^-12 for their memory modules ... observed error rate is 4 orders of magnitude lower than expected"; 2009 Google's "DRAM Errors in the Wild: A Large-Scale Field Study"). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10^-12 vendors BER from CERN07) or once in two days (10^-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM ("mean correctable error rates of 2000–6000 per GB per year").

So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?

For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?

I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.

I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.

281

asked May 11 '14 00:05

osgx

2 Answers

The thing is, ECC is dirt cheap compared to "software ECC countermeasures". You can easily detect if they have ECC modules and complain (or print a warning) when they don't.

http://www.cyberciti.biz/faq/ecc-memory-modules/

For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?

Er, you you will never "see" the bit-flips on the bus. They are literally caused by a particle hitting RAM, flipping a bit. Only much later can you notice that you read out something different than your wrote in. To detect this only via the bus, you would need a duplicate copy of all your RAM (i.e. create a shadow copy of what is in your real RAM, so you can verify every read returns what was written to that location.)

try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?

The Redis guy has a nice write-up on an algorithm for testing RAM for problems. http://antirez.com/news/43 But this is really looking for RAM errors, not random bit-flips.

If "recompute checksums" only works when you are NOT writing to the memory. That might be "good enough" but you'll need to figure out which pages are not being written to.

To catch 100% of the errors, every write must be pre-ceeded by computing the checksum of that block of memory, then comparing it to the recorded checksum (to make sure that block hasn't degraded in RAM). Only then is it safe to do the write and then update the checksum. As you can imagine, the performance of this will be horrible (at least 100x slower) performance.

I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.

Well, there is a simple method to detect 100% of the errors, at a cost of 50% performance: Just run the computation on 2 boxes at once (or on one box at two different times, maybe with a RAM test in between if you are paranoid.) If the results differ, you have detected an error.

BraveNewCurrency

The answer to the question is yes, and a proof for that is the software SoftECC posted in the comments!

Just a note that SoftECC is a kernel level solution. If a user-land app is used, it will be a third stage of redundancy, that seems not necessary.

answered Sep 28 '22 19:09

vitorafsr

Related questions
                            
                                What is the best way to prevent out of memory (OOM) freezes on Linux?
                            
                                Why is R slowing down as time goes on, when the computations are the same?
                            
                                UWP Windows 10 App memory increasing on navigation
                            
                                What is the System objects in chrome javascript memory profiler
                            
                                How do I call the original "operator new" if I have overloaded it?
                            
                                The maximum amount of memory any single process on Windows can address
                            
                                Why do ints require three times as much memory in Python?
                            
                                Can immutable be a memory hog?
                            
                                How do I track down a memory leak in my Ruby code?
                            
                                How to handle Vue 2 memory usage for large data (~50 000 objects)
                            
                                Where are methods stored in memory?
                            
                                C++ STL allocator vs operator new
                            
                                In Python, what is `sys.maxsize`?
                            
                                Python multiprocessing - How to release memory when a process is done?
                            
                                Why is char[] on the stack but char * on the heap?
                            
                                Ant: passing compilerarg into javac
                            
                                What is locality of reference?
                            
                                What is the fastest way to count the unique elements in a list of billion elements?
                            
                                Why does C# memory stream reserve so much memory?
                            
                                In a CUDA kernel, how do I store an array in "local thread memory"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Software memory bit-flip detection for platforms without ECC

Tags:

memory

linux-kernel

error-detection

osgx

People also ask

2 Answers

BraveNewCurrency

vitorafsr

Recent Activity

Donate For Us