I am trying to write a simple script to verify the HDFS and local filesystem checksums.
On HDFS i get -
[m@x01tbipapp3a ~]$ hadoop fs -checksum /user/m/file.txt
/user/m/file.txt MD5-of-0MD5-of-512CRC32C **000002000000000000000000755ca25bd89d1a2d64990a68dedb5514**
On the Local File System, I get -
[m@x01tbipapp3a ~]$ cksum file.txt
**3802590149 26276247** file.txt
[m@x01tbipapp3a ~]$ md5sum file.txt
**c1aae0db584d72402d5bcf5cbc29134c** file.txt
Now how do i compare them. I tried to convert the HDFS checksum from Hex to Decimal to see if it matches the chksum but it does not...
Is there a way to compare the 2 checksums using any algorithm?
thanks
A checksum is a small-sized datum derived from a block of digital data for the purpose of detecting errors. HDFS calculates/computes checksums for each data block and eventually stores them in a separate hidden file in the same HDFS namespace.
Open a terminal window. Type the following command: md5sum [type file name with extension here] [path of the file] -- NOTE: You can also drag the file to the terminal window instead of typing the full path. Hit the Enter key. You'll see the MD5 sum of the file.
HDFS fsck is used to check the health of the file system, to find missing files, over replicated, under replicated and corrupted blocks.
This is not a solution but a workaround which can be used. Local File Checksum: cksum test.txt
HDFS Checksum :
cksum hadoop fs -cat /user/test/test.txt > tmp.txt
tmp.txt
You can compare them.
Hope it helps.
I was also confused because the md5 was not matching,turned out Hadoop checksum is not a simple md5, its a MD5 of MD5 of CRC32C :-)
see this
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201508.mbox/%3CCAMm20=5K+f3ArVtoo9qMSesjgd_opdcvnGiDTkd3jpn7SHkysg@mail.gmail.com%3E
and this
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201103.mbox/%[email protected]%3E
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With