Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HDFS File Checksum

I am trying to check the consistency of a file after copying to HDFS using Hadoop API - DFSCleint.getFileChecksum().

I am getting the following output for the above code:

Null
HDFS : null
Local : null

Can anyone point out the error or mistake? Here is the Code :

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileChecksum;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;


public class fileCheckSum {

    /**
     * @param args
     * @throws IOException 
     */
    public static void main(String[] args) throws IOException {
        // TODO Auto-generated method stub

        Configuration conf = new Configuration();

         FileSystem hadoopFS = FileSystem.get(conf);
    //  Path hdfsPath = new Path("/derby.log");

        LocalFileSystem localFS = LocalFileSystem.getLocal(conf);
    //      Path localPath = new Path("file:///home/ubuntu/derby.log");


    //  System.out.println("HDFS PATH : "+hdfsPath.getName());
    //      System.out.println("Local PATH : "+localPath.getName());

        FileChecksum hdfsChecksum = hadoopFS.getFileChecksum(new Path("/derby.log"));
        FileChecksum localChecksum = localFS.getFileChecksum(new Path("file:///home/ubuntu/derby.log"));


        if(null!=hdfsChecksum || null!=localChecksum){
            System.out.println("HDFS Checksum : "+hdfsChecksum.toString()+"\t"+hdfsChecksum.getLength());
            System.out.println("Local Checksum : "+localChecksum.toString()+"\t"+localChecksum.getLength());

            if(hdfsChecksum.toString().equals(localChecksum.toString())){
                System.out.println("Equal");
            }else{
                System.out.println("UnEqual");

            }
        }else{
            System.out.println("Null");
            System.out.println("HDFS : "+hdfsChecksum);
            System.out.println("Local : "+localChecksum);

        }

    }

}
like image 998
pradeep Avatar asked Jan 28 '13 13:01

pradeep


People also ask

What is checksum in HDFS?

A checksum is a small-sized datum derived from a block of digital data for the purpose of detecting errors. HDFS calculates/computes checksums for each data block and eventually stores them in a separate hidden file in the same HDFS namespace.

What is the checksum of a file?

A checksum (also sometimes referred to as a hash) is an alphanumeric value that uniquely represents the contents of a file. Checksums are often used to verify the integrity of files downloaded from an external source, such as an installation file.

How do I add a checksum to a file?

To produce a checksum, you run a program that puts that file through an algorithm. Typical algorithms used for this include MD5, SHA-1, SHA-256, and SHA-512. These algorithms use a cryptographic hash function that takes an input and generates a fixed-length alphanumeric string regardless of the size of the file.


1 Answers

Since you aren't setting a remote address on the conf and essentially using the same configuration, both hadoopFS and localFS are pointing to an instance of LocalFileSystem.

getFileChecksum isn't implemented for LocalFileSystem and returns null. It should be working for DistributedFileSystem though, which if your conf is pointing to a distributed cluster, FileSystem.get(conf) should return an instance of DistributedFileSystem that returns an MD5 of MD5 of CRC32 checksums of chunks of size bytes.per.checksum. This value depends on the block size and the cluster-wide config, bytes.per.checksum. That's why these two params are also encoded in the return value of the distributed checksum as the name of the algorithm: MD5-of-xxxMD5-of-yyyCRC32 where xxx is number of CRC checksums per block and yyy is the bytes.per.checksum parameter.

The getFileChecksum isn't designed to be comparable across filesystems. Although it's possible to simulate the distributed checksum locally, or hand-craft map-reduce jobs to calculate equivalents of local hashes, I suggest relying Hadoop's own integrity checks that happens when a files gets written to or read from Hadoop

like image 94
omid Avatar answered Sep 23 '22 23:09

omid