HDFS File Comparison

Question

How can I compare two HDFS files since there is no diff?

I was thinking of using Hive tables and loading data from HDFS and then using join statements on 2 tables. Is there any better approach?

Charles Menguy · Accepted Answer

There is no diff command provided with hadoop, but you can actually use redirections in your shell with the diff command:

diff <(hadoop fs -cat /path/to/file) <(hadoop fs -cat /path/to/file2)

If you just want to know if 2 files are identical or not without caring to know the differences, I would suggest another checksum-based approach: you could get the checksums for both files and then compare them. I think Hadoop doesn't need to generate checksums because they are already stored so it should be fast, but I may be wrong. I don't think there's a command line option for that but you could easily do this with the Java API and create a small app:

FileSystem fs = FileSystem.get(conf);
chksum1 = fs.getFileChecksum(new Path("/path/to/file"));
chksum2 = fs.getFileChecksum(new Path("/path/to/file2"));
return chksum1 == chksum2;

Joe K · Answer

Well, the simplest answer is probably:

diff <(hadoop fs -cat file1) <(hadoop fs -cat file2)

It will just run on your local machine. If that's too slow, then yes, you'd have to do something with Hive and MapReduce, but that's a little trickier, and won't exactly match the in-order comparison that diff does.

HDFS File Comparison

Tags:

hadoop

hive

hdfs

ftw

2 Answers

Charles Menguy

Joe K

Recent Activity

Donate For Us

HDFS File Comparison

Tags:

hadoop

hive

hdfs

ftw

2 Answers

Charles Menguy

Joe K

Related questions

Recent Activity

Donate For Us