Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HDFS File Comparison

Tags:

hadoop

hive

hdfs

How can I compare two HDFS files since there is no diff?

I was thinking of using Hive tables and loading data from HDFS and then using join statements on 2 tables. Is there any better approach?

like image 479
ftw Avatar asked Jan 23 '13 20:01

ftw


2 Answers

There is no diff command provided with hadoop, but you can actually use redirections in your shell with the diff command:

diff <(hadoop fs -cat /path/to/file) <(hadoop fs -cat /path/to/file2)

If you just want to know if 2 files are identical or not without caring to know the differences, I would suggest another checksum-based approach: you could get the checksums for both files and then compare them. I think Hadoop doesn't need to generate checksums because they are already stored so it should be fast, but I may be wrong. I don't think there's a command line option for that but you could easily do this with the Java API and create a small app:

FileSystem fs = FileSystem.get(conf);
chksum1 = fs.getFileChecksum(new Path("/path/to/file"));
chksum2 = fs.getFileChecksum(new Path("/path/to/file2"));
return chksum1 == chksum2;
like image 170
Charles Menguy Avatar answered Sep 30 '22 14:09

Charles Menguy


Well, the simplest answer is probably:

diff <(hadoop fs -cat file1) <(hadoop fs -cat file2)

It will just run on your local machine. If that's too slow, then yes, you'd have to do something with Hive and MapReduce, but that's a little trickier, and won't exactly match the in-order comparison that diff does.

like image 45
Joe K Avatar answered Sep 30 '22 14:09

Joe K