I have written a UDP based transfer protocol using C++ 11 (VS2013). It's blazing fast - and works great 99,9% of the time.
But I have observed a few times that the wrong bytes are written to disk (Samsung 250 GB SSD 850 EVO) - or atleast it seems so.
Here's basically what sometime happens when I transfer a 6GB test file:
After the entire file is transferred I do a full SHA256 hash of the 6GB file on both the server and the client, but to my terror I have observed twice the last few days that the hash is NOT the same (when making approx 20 test transfers).
After comparing the files in Beyond Compare, I usually find that there is one or two bits (in a 6 GB file) that is wrong on the serverside.
See screenshot below:
Server code - invoked after DataPackage hash has been verified
void WriteToFile(long long position, unsigned char * data, int lengthOfData){
boost::lock_guard<std::mutex> guard(filePointerMutex);
//Open if required
if (filePointer == nullptr){
_wfopen_s(&filePointer, (U("\\\\?\\") + AbsoluteFilePathAndName).c_str(), L"wb");
}
//Seek
fsetpos(filePointer, &position);
//Write - not checking the result of the fwrite operation - should I?
fwrite(data, sizeof(unsigned char), lengthOfData, filePointer);
//Flush
fflush(filePointer);
//A separate worker thread is closing all stale filehandles
//(and setting filePointer to NULLPTR). This isn't invoked until 10 secs
//after the file has been transferred anyways - so shouldn't matter
}
So to sum up:
How can this happen? Is my hardware (bad ram/disk) to blame for this?
Do I actually need to read from disk after writing, and doing e.g. memcmp in order to be 100% sure that the correct bytes are written to disk? (Oh boy - what a performance hit that will be...)
On my local pc - it turned out that it was a RAM issue. Found out by running memtest86.
Nevertheless - I modified the code for our software that runs on our production servers - making it read from disk to verify that the correct bytes were in fact written. These servers write about 10TB to disk every day - and after a week of running the new code - the error happened once. The software corrects this by writing and verifying again - but it's still interesting to see that it actually happened.
1 bit out of 560000000000000 bits were written wrong to disk. Amazing.
I will probably run memtest86 on this server later to see if this is also a RAM issue - but I'm not really super concerned about this anymore since file integrity is more or less ensured, and the servers are showing no signs of hardware problems otherwise.
So - if file integrity is extremely important to you (like it is for us) - then don't trust your hardware 100% and validate reading/writing operations. Anomalies might be an early sign of HW problems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With