Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I diff two tables in HBase

Tags:

hadoop

hbase

I am trying to compare two different tables in HBase so that I can automate the validation of some ETL processes that we use to move data in HBase. What's the best way to compare two tables in HBase?

My use case is below:

What I am trying to do is create one table that will be my expected output. This table will contain all of the data that I am expecting to be created via executing the teams code against an input file. I will then take the diff between the actual output table and the expected output table to verify the integrity of the component under test.

like image 668
RHicke Avatar asked Sep 18 '13 03:09

RHicke


People also ask

How do I join two tables in HBase?

Using Hive or Impala is costly when data is to large and we face issue like Hbase kill(region server Down) . so it is convenient when data is small but not for large Data. In mapreduce take Hbase table object to take one table and by extending tablemapper use 2nd table. By this way you can join 2 tables.

Which code should I use to list all tables in HBase?

Hey, list is the command that is used to list all the tables in HBase.


2 Answers

I don't know of anything out of the box but you can write a multi-table map/reduce.

The mappers will just emit keys from each table (with a value being all the hbase key values plus a table name) The reducer can make sure it has 2 records of each key and compare the key-values. When there's only one key it can see which table is out of sync

like image 184
Arnon Rotem-Gal-Oz Avatar answered Sep 20 '22 21:09

Arnon Rotem-Gal-Oz


I know this question is a little old, but how large are the tables? If they will both fit into memory you could load them into Pig using HBaseStorage, then use Pig's built in DIFF function to compare the resulting bags.

This will work even with large tables that don't fit into memory, according to the docs, but it will be extremely slow.

like image 44
Brian Schrameck Avatar answered Sep 22 '22 21:09

Brian Schrameck