Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Possible to detect bit errors in memory in software?

A friend and I were curious as to whether you could detect levels of ionizing radiation by looking at rates of single bit errors in memory. I did a little research and I guess most errors are caught and fixed at the hardware level. Would there be any way to detect errors in software (say, in c code on a pc)?

like image 489
Jack Rogers Avatar asked Aug 16 '11 23:08

Jack Rogers


People also ask

Can ECC correct multiple bit errors?

Chipkill ECC is a more effective version that also corrects for multiple bit errors, including the loss of an entire memory chip.

How common are bit errors?

These can all not be corrected, but are extremely rare. A 1 Gigabit ECC DRAM contains 16 Million blocks of 64 bit datawords. Per each of these 64 bit words, one error is correctable. In other words: Statistically one out of 16 million hits might be a double-bit error.

How does ECC work for memories?

How ECC memory works. ECC memory includes extra memory bits and memory controllers that control the extra bits in an additional chip on the module. ECC memory uses the extra bits to store an encrypted code when writing data to memory, and the ECC code is stored at the same time.

What is EDc in memory?

To enable safety, these processors come with error detection and correction (EDC) support for various memories.


2 Answers

I'm sure it depends on the architecture you're running on, but I'm pretty certain you won't be detecting any single bit errors in your memory any time soon. Most if not all RAM controllers should have implemented some form of ECC protection to safeguard against the rare bit problems RAM chips have. DDR RAM, for example, is VERY reliable compared to crap mediums like flash memory, which will be spec'd to REQUIRE X number of bits of ECC protection (somewhere between 8 and 16 or so) before they guarantee functionality. As long as you have under a certain number of bit errors, the bad bits will be corrected and probably unreported before even reaching the CPU software level.

Silent (Unreported) data corruption from something as simple as a single bit error is considered a huge "no-no" in the storage industry, so your memory manufacturer has probably done their darndest to prevent your application from seeing it, much less making you deal with it!

In any case, one common way to detect problems in any sort of memory is to run simple write compare loops over the address space. Write 0's to all your memory and read it back to detect stuck '1' data lines, write-read-compare F's to memory to detect stuck '0' data lines, and run a data ramp to help detect addressing problems. The width of the data ramp should adjust according to the address size. (i.e. 0x00, 0x01, 0x02... or 0x0000, 0x0001, 0x0002, etc). You can easily do these types of things using storage performance benchmarking tools like Iometer or similar, although it may be just as easy to write yourself.

like image 133
Lncn Avatar answered Oct 03 '22 01:10

Lncn


Realistically, unless you're going to dedicate a lot of time to the problem, you might as well quit before you start. Even if you do detect an error, chances are pretty fair it's due to something like a power problem, not ionizing radiation (and you normally won't have any way to tell which you've encountered).

If you do decide to go ahead anyway, the obvious way to test is allocate some memory, write values to it, and read them back. You want to follow sufficiently predictable patterns that you can figure out the expected value is without reading from other memory (at least if you want to be able to isolate the error, and not just identify that something bad has happened).

If you really want to differentiate between ionizing radiation and other errors, it should at least be theoretically possible. Run your test on a number of computers at different altitudes simultaneously, and see if you see a higher rate at higher altitude.

like image 37
Jerry Coffin Avatar answered Oct 03 '22 01:10

Jerry Coffin