I am trying to compare two different tables in HBase so that I can automate the validation of some ETL processes that we use to move data in HBase. What's the best way to compare two tables in HBase?
My use case is below:
What I am trying to do is create one table that will be my expected output. This table will contain all of the data that I am expecting to be created via executing the teams code against an input file. I will then take the diff between the actual output table and the expected output table to verify the integrity of the component under test.
Using Hive or Impala is costly when data is to large and we face issue like Hbase kill(region server Down) . so it is convenient when data is small but not for large Data. In mapreduce take Hbase table object to take one table and by extending tablemapper use 2nd table. By this way you can join 2 tables.
Hey, list is the command that is used to list all the tables in HBase.
I don't know of anything out of the box but you can write a multi-table map/reduce.
The mappers will just emit keys from each table (with a value being all the hbase key values plus a table name) The reducer can make sure it has 2 records of each key and compare the key-values. When there's only one key it can see which table is out of sync
I know this question is a little old, but how large are the tables? If they will both fit into memory you could load them into Pig using HBaseStorage, then use Pig's built in DIFF function to compare the resulting bags.
This will work even with large tables that don't fit into memory, according to the docs, but it will be extremely slow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With