Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?
I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?
I read client does checksum before data is written to HDFS
Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.
A checksum is an indicator (usually in a form of a short string of letters and numbers) that enables you to verify if the original data has been modified during storage or transmission.
HDFS calculates/computes checksums for each data block and eventually store in a separate hidden file in the same HDFS namespace.
I wrote a library with which you can calculate the checksum of local file, just the way hadoop does it on hdfs files.
So, you can compare the checksum to cross check. https://github.com/srch07/HDFSChecksumForLocalfile
Checksum for a file can be calculated using hadoop fs command.
Usage: hadoop fs -checksum URI
Returns the checksum information of a file.
Example:
hadoop fs -checksum hdfs://nn1.example.com/file1 hadoop fs -checksum file:///path/in/linux/file1
Refer : Hadoop documentation for more details
So if you want to comapre file1 in both linux and hdfs you can use above utility.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With