Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wrong bytes are sometimes written to disk. Hardware problems?

I have written a UDP based transfer protocol using C++ 11 (VS2013). It's blazing fast - and works great 99,9% of the time.

enter image description here

But I have observed a few times that the wrong bytes are written to disk (Samsung 250 GB SSD 850 EVO) - or atleast it seems so.

Here's basically what sometime happens when I transfer a 6GB test file:

  1. The file is split up into smaller UDP datapackages - 64K in size. (Network layer disassebles and reassembles the UDP datagrams to a larger package).
  2. Client sends the datapackage (udp) to the server - the payload is encrypted using AES256 (OpenSSL) and contains data + metadata. The Payload also contains a SHA256 Hash of the entire payload - as an extra integrity check on top up the UDP checksum.
  3. Server receives the datapackage, sends an "ACK" package back to the Client and then calculates the SHA256 Hash. The hash is identical to the Clients hash - all is good
  4. Server then writes the data of the package to disk (Using fwrite instead of streams due to the huge performance differences). The server only processes one package at a time - and each filepointer has a mutex guard which protects it from being closed by another worker thread that closes filepointers that have been inactive for 10 secs.
  5. Client receives UDP "ACK" packages and re-sends packages that have not been acked (meaning they didn't make it). The rate of incoming ACK packages controls the sending speed of the client (aka. congestion control/throtteling). The order of packages received on the server does not matter since each package contains a Position value (where in the file the data should be written).

After the entire file is transferred I do a full SHA256 hash of the 6GB file on both the server and the client, but to my terror I have observed twice the last few days that the hash is NOT the same (when making approx 20 test transfers).

After comparing the files in Beyond Compare, I usually find that there is one or two bits (in a 6 GB file) that is wrong on the serverside.

See screenshot below:enter image description here

Server code - invoked after DataPackage hash has been verified

void WriteToFile(long long position, unsigned char * data, int lengthOfData){

    boost::lock_guard<std::mutex> guard(filePointerMutex);

    //Open if required
    if (filePointer == nullptr){
        _wfopen_s(&filePointer, (U("\\\\?\\") + AbsoluteFilePathAndName).c_str(), L"wb");
    }

    //Seek
    fsetpos(filePointer, &position);

    //Write - not checking the result of the fwrite operation - should I?
    fwrite(data, sizeof(unsigned char), lengthOfData, filePointer);

    //Flush
    fflush(filePointer);

    //A separate worker thread is closing all stale filehandles 
    //(and setting filePointer to NULLPTR). This isn't invoked until 10 secs
    //after the file has been transferred anyways - so shouldn't matter
}

So to sum up:

  • The char * was correct in memory on the server - otherwise the Servers SHA256 Hash would have failed - right? (a hash collision with sha256 is extremely unlikely).
  • Corruption seems to happen when writing to disk. Since there are about 95.000 of these 64k packages written to disk when sending a 6GB file - and it only happens once or twice (when it happens at all) - means that it is a rare phenomenon

How can this happen? Is my hardware (bad ram/disk) to blame for this?

Do I actually need to read from disk after writing, and doing e.g. memcmp in order to be 100% sure that the correct bytes are written to disk? (Oh boy - what a performance hit that will be...)

like image 896
Njål Arne Gjermundshaug Avatar asked Aug 18 '16 08:08

Njål Arne Gjermundshaug


1 Answers

On my local pc - it turned out that it was a RAM issue. Found out by running memtest86.

Nevertheless - I modified the code for our software that runs on our production servers - making it read from disk to verify that the correct bytes were in fact written. These servers write about 10TB to disk every day - and after a week of running the new code - the error happened once. The software corrects this by writing and verifying again - but it's still interesting to see that it actually happened.

1 bit out of 560000000000000 bits were written wrong to disk. Amazing.

I will probably run memtest86 on this server later to see if this is also a RAM issue - but I'm not really super concerned about this anymore since file integrity is more or less ensured, and the servers are showing no signs of hardware problems otherwise.

So - if file integrity is extremely important to you (like it is for us) - then don't trust your hardware 100% and validate reading/writing operations. Anomalies might be an early sign of HW problems.

like image 176
Njål Arne Gjermundshaug Avatar answered Sep 28 '22 20:09

Njål Arne Gjermundshaug