How can I compare two HDFS files since there is no diff
?
I was thinking of using Hive tables and loading data from HDFS and then using join statements on 2 tables. Is there any better approach?
There is no diff
command provided with hadoop, but you can actually use redirections in your shell with the diff
command:
diff <(hadoop fs -cat /path/to/file) <(hadoop fs -cat /path/to/file2)
If you just want to know if 2 files are identical or not without caring to know the differences, I would suggest another checksum-based approach: you could get the checksums for both files and then compare them. I think Hadoop doesn't need to generate checksums because they are already stored so it should be fast, but I may be wrong. I don't think there's a command line option for that but you could easily do this with the Java API and create a small app:
FileSystem fs = FileSystem.get(conf);
chksum1 = fs.getFileChecksum(new Path("/path/to/file"));
chksum2 = fs.getFileChecksum(new Path("/path/to/file2"));
return chksum1 == chksum2;
Well, the simplest answer is probably:
diff <(hadoop fs -cat file1) <(hadoop fs -cat file2)
It will just run on your local machine. If that's too slow, then yes, you'd have to do something with Hive and MapReduce, but that's a little trickier, and won't exactly match the in-order comparison that diff does.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With