I am trying to check the consistency of a file after copying to HDFS using Hadoop API - DFSCleint.getFileChecksum().
I am getting the following output for the above code:
Null
HDFS : null
Local : null
Can anyone point out the error or mistake? Here is the Code :
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileChecksum;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;
public class fileCheckSum {
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
FileSystem hadoopFS = FileSystem.get(conf);
// Path hdfsPath = new Path("/derby.log");
LocalFileSystem localFS = LocalFileSystem.getLocal(conf);
// Path localPath = new Path("file:///home/ubuntu/derby.log");
// System.out.println("HDFS PATH : "+hdfsPath.getName());
// System.out.println("Local PATH : "+localPath.getName());
FileChecksum hdfsChecksum = hadoopFS.getFileChecksum(new Path("/derby.log"));
FileChecksum localChecksum = localFS.getFileChecksum(new Path("file:///home/ubuntu/derby.log"));
if(null!=hdfsChecksum || null!=localChecksum){
System.out.println("HDFS Checksum : "+hdfsChecksum.toString()+"\t"+hdfsChecksum.getLength());
System.out.println("Local Checksum : "+localChecksum.toString()+"\t"+localChecksum.getLength());
if(hdfsChecksum.toString().equals(localChecksum.toString())){
System.out.println("Equal");
}else{
System.out.println("UnEqual");
}
}else{
System.out.println("Null");
System.out.println("HDFS : "+hdfsChecksum);
System.out.println("Local : "+localChecksum);
}
}
}
A checksum is a small-sized datum derived from a block of digital data for the purpose of detecting errors. HDFS calculates/computes checksums for each data block and eventually stores them in a separate hidden file in the same HDFS namespace.
A checksum (also sometimes referred to as a hash) is an alphanumeric value that uniquely represents the contents of a file. Checksums are often used to verify the integrity of files downloaded from an external source, such as an installation file.
To produce a checksum, you run a program that puts that file through an algorithm. Typical algorithms used for this include MD5, SHA-1, SHA-256, and SHA-512. These algorithms use a cryptographic hash function that takes an input and generates a fixed-length alphanumeric string regardless of the size of the file.
Since you aren't setting a remote address on the conf
and essentially using the same configuration, both hadoopFS
and localFS
are pointing to an instance of LocalFileSystem
.
getFileChecksum
isn't implemented for LocalFileSystem
and returns null. It should be working for DistributedFileSystem
though, which if your conf
is pointing to a distributed cluster, FileSystem.get(conf)
should return an instance of DistributedFileSystem
that returns an MD5 of MD5 of CRC32 checksums of chunks of size bytes.per.checksum
. This value depends on the block size and the cluster-wide config, bytes.per.checksum
. That's why these two params are also encoded in the return value of the distributed checksum as the name of the algorithm: MD5-of-xxxMD5-of-yyyCRC32 where xxx is number of CRC checksums per block and yyy is the bytes.per.checksum
parameter.
The getFileChecksum
isn't designed to be comparable across filesystems. Although it's possible to simulate the distributed checksum locally, or hand-craft map-reduce jobs to calculate equivalents of local hashes, I suggest relying Hadoop's own integrity checks that happens when a files gets written to or read from Hadoop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With