Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What causes Silent Data Corruption on HDDs?

Some landmark studies a number of years ago now showed that silent corruption in large datasets was far more widespread than previously anticipated (and today I guess you'd say it's more than commonly realized).

Assuming that the application and OS wrote a sector and had time to let everything flush, with no crash or abnormal shutdown, or software bugs that would command wrong data to be saved.

Later, a sector is read back in, and there is no read error from the HDD. But it contains the wrong data.

Since HDD data encoding contains error correction codes, I would assume that any mysterious state change to a bit would generally be noticed by the checking. Even if the check is not very strong so some errors slip through, there would still be vastly more detected errors that tell you something is wrong with the drive. But that doesn't happen: apparently, data is found to be wrong with no symptoms.

How can that happen?

My experience on a desktop PC is that sometimes files that were once good are later found to be bad, but perhaps that is due to unnoticed problems during writing, either moving the sectors or in the file system tracking of the data. Point is, errors may be introduced at write time where data is corrupted inside the HDD (or RAID hardware) so wrong data is written with its error correction codes to match. If that is the (only) cause, than a single verify should be enough to show that it did write OK.

Or, does data go bad after it has been seen to be OK on the disk? That is, verify once and all is fine; verify later and an error is found, when that sector has not been written in the interim. I think this is what is meant, since the write-time errors would be easy to deal with through improved flushing checking.

So how can that happen without tripping the error correction codes that go with the data?

like image 379
JDługosz Avatar asked Feb 10 '23 22:02

JDługosz


2 Answers

Some ways silent data corruption could happen:

  • Corruption in memory before the data is written (in this case even filesystem level checksums will not help you if the checksum is calculated after the corruption)
  • Errors in the SATA cables that by chance match the checksum
  • Bit flips in disk drive cache memory (not sure if those are checksummed, probably depends on make & model)
  • Bug in drive firmware that corrupts the data before writing (with the checksum matching the corrupted data)
  • corruption of the block on the disk platter that by chance matches the checksum
  • read that returns corrupted data to the drive controller that by chance matches the checksum
  • bugs in firmware that corrupt the data after verifying the checksum
  • corruption in main memory after the data has been moved there
  • bugs in the software that processes the data (although this is usually not considered part of this list, but is classified as a software bug)

Corruption that by chance matches its error code is by itself very unlikely, but the large amount of data and the birthday paradox ensure that they do happen. Today's drives have internal read errors all the time and rely heavily on checksums to catch them. If so they just re-read the sector until they have a good read, and if a sector becomes too bad they silently swap it with a spare sector. SATA controllers probably also silently re-send data if a checksum error occurrs while reading data from the SATA cable.

The chance of a random corruption still matching the checksum can be made arbitrarily small by using a longer checksum, but that involves more storage and processing overhead. And in the case of standardised protocols such as SATA you can't just change the checksum size without breaking compatibility. And no protocol or disk level checksumming will save you from firmware bugs, or other software bugs for that matter.

The big advantage of filesystem level checksums like in ZFS/Btrfs is that they can catch all of these errors except main memory corruption (use ECC memory to protect against that) and software bugs. And they can use a larger checksum block size than a single disk block, to reduce the storage overhead of longer checksums.

like image 186
JanKanis Avatar answered Mar 20 '23 14:03

JanKanis


See http://en.wikipedia.org/wiki/Silent_data_corruption#Silent_data_corruption that provides ample explanations. I would also like to mention the birthday paradox that explains why the probability of an error is higher that intuitively expected. See http://en.wikipedia.org/wiki/Birthday_paradox.

Upon writting, a sector a CRC is calculated and written to disk. Upon reading, the data is read along with the CRC. The CRC is recalculated from the data that has been read from the disk and then compared with the CRC read from the disk.

What could go wrong at the disk level but would be detected: - One or more data bit did not get written correctly. - One or more CRC bit did not get written properly to disk. - Both have been correctly written to disk but dammaged later on. - Both have been written correctly but the controller went bad or is buggy.

What could go wrong on the disk but would go undetected (silent error): - Data or CRC is corrupted either because badly written on disk or upon reading due to a deffective sector, yet (although with low probability) the calculated CRC matches the CRC read from the device. That's where the birthday paradox comes into play.

Passed the disk: - Data is read correctly from the disk by the controller but is incorrectly transmitted to memory through the SATA cable. I assume SATA has some type of error correction but again, you can repeat the process here. - The data made it through from the disk to the controller and went through the SATA cable but a memory bit got inverted.

like image 20
Tarik Avatar answered Mar 20 '23 15:03

Tarik