Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checksum verification in Hadoop

Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?

I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?

I read client does checksum before data is written to HDFS

Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.

like image 984
chhaya vishwakarma Avatar asked Aug 10 '15 12:08

chhaya vishwakarma


People also ask

What is checksum verification?

A checksum is an indicator (usually in a form of a short string of letters and numbers) that enables you to verify if the original data has been modified during storage or transmission.

Where is checksum stored in HDFS?

HDFS calculates/computes checksums for each data block and eventually store in a separate hidden file in the same HDFS namespace.


2 Answers

I wrote a library with which you can calculate the checksum of local file, just the way hadoop does it on hdfs files.

So, you can compare the checksum to cross check. https://github.com/srch07/HDFSChecksumForLocalfile

like image 197
Abhishek Anand Avatar answered Sep 22 '22 15:09

Abhishek Anand


Checksum for a file can be calculated using hadoop fs command.

Usage: hadoop fs -checksum URI

Returns the checksum information of a file.

Example:

hadoop fs -checksum hdfs://nn1.example.com/file1 hadoop fs -checksum file:///path/in/linux/file1

Refer : Hadoop documentation for more details

So if you want to comapre file1 in both linux and hdfs you can use above utility.

like image 44
Karthik Avatar answered Sep 26 '22 15:09

Karthik