Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to prevent "partial write" data corruption during power loss?

In an embedded environment (using MSP430), I have seen some data corruption caused by partial writes to non-volatile memory. This seems to be caused by power loss during a write (to either FRAM or info segments).

I am validating data stored in these locations with a CRC.

My question is, what is the correct way to prevent this "partial write" corruption? Currently, I have modified my code to write to two separate FRAM locations. So, if one write is interrupted causing an invalid CRC, the other location should remain valid. Is this a common practice? Do I need to implement this double write behavior for any non-volatile memory?

like image 635
schumacher574 Avatar asked Jan 10 '14 18:01

schumacher574


2 Answers

A simple solution is to maintain two versions of the data (in separate pages for flash memory), the current version and the previous version. Each version has a header comprising of a sequence number and a word that validates the sequence number - simply the 1's complement of the sequence number for example:

---------
|  seq  |
---------
| ~seq  |
---------
|       |
| data  |
|       |
---------

The critical thing is that when the data is written the seq and ~seq words are written last.

On start-up you read the data that has the highest valid sequence number (accounting for wrap-around perhaps - especially for short sequence words). When you write the data, you overwrite and validate the oldest block.

The solution you are already using is valid so long as the CRC is written last, but it lacks simplicity and imposes a CRC calculation overhead that may not be necessary or desirable.

On FRAM you have no concern about endurance, but this is an issue for Flash memory and EEPROM. In this case I use a write-back cache method, where the data is maintained in RAM, and when modified a timer is started or restarted if it is already running - when the timer expires, the data is written - this prevents burst-writes from thrashing the memory, and is useful even on FRAM since it minimises the software overhead of data writes.

like image 196
Clifford Avatar answered Oct 17 '22 15:10

Clifford


Our engineering team takes a two pronged approach to these problem: Solve it in hardware and software!

The first is a diode and capacitor arrangement to provide a few milliseconds of power during a brown-out. If we notice we've lost external power, we prevent the code from entering any non-violate writes.

Second, our data is particularly critical for operation, it updates often and we don't want to wear out our non-violate flash storage (it only supports so many writes.) so we actually store the data 16 times in flash and protect each record with a CRC code. On boot, we find the newest valid write and then start our erase/write cycles.

We've never seen data corruption since implementing our frankly paranoid system.

Update:

I should note that our flash is external to our CPU, so the CRC helps validates the data if there is a communication glitch between the CPU and flash chip. Furthermore, if we experience several glitches in a row, the multiple writes protect against data loss.

like image 6
DrRobotNinja Avatar answered Oct 17 '22 17:10

DrRobotNinja