HBase

Question

NOTE : Its a few hours ago that I have begun HBase and I come from an RDBMS background :P

I have a RDBMS-like table CUSTOMERS having the following columns:

CUSTOMER_ID STRING
CUSTOMER_NAME STRING
CUSTOMER_EMAIL STRING
CUSTOMER_ADDRESS STRING
CUSTOMER_MOBILE STRING

I have thought of the following HBase equivalent :

table : CUSTOMERS rowkey : CUSTOMER_ID

column family : CUSTOMER_INFO

columns : NAME EMAIL ADDRESS MOBILE

From whatever I have read, a primary key in an RDBMS table is roughly similar to a HBase table's rowkey. Accordingly, I want to keep CUSTOMER_ID as the rowkey.

My questions are dumb and straightforward :

Irrespective of whether I use a shell command or the HBaseAdmin java class, how do I define the rowkey? I didn't find anything to do it either in the shell or in the HBaseAdmin class(some thing like HBaseAdmin.createSuperKey(...))
Given a HBase table, how to determine the rowkey details i.e which are the values used as rowkey?
I understand that rowkey design is a critical thing. Suppose a customer id is receives values like CUST_12345, CUST_34434 and so on, how will HBase use the rowkey to decide in which region do particular rows reside(assuming that region concept is similar to DB horizontal partitioning)?

***Edited to add sample code snippet

I'm simply trying to create one row for the customer table using 'put' in the shell. I did this :

hbase(main):011:0> put  'CUSTOMERS', 'CUSTID12345', 'CUSTOMER_INFO:NAME','Omkar Joshi'
0 row(s) in 0.1030 seconds

hbase(main):012:0> scan 'CUSTOMERS'
ROW                              COLUMN+CELL
 CUSTID12345                     column=CUSTOMER_INFO:NAME, timestamp=1365600052104, value=Omkar Joshi
1 row(s) in 0.0500 seconds

hbase(main):013:0> put  'CUSTOMERS', 'CUSTID614', 'CUSTOMER_INFO:NAME','Prachi Shah', 'CUSTOMER_INFO:EMAIL','Prachi.Shah@lntinfotech.com'

ERROR: wrong number of arguments (6 for 5)

Here is some help for this command:
Put a cell 'value' at specified table/row/column and optionally
timestamp coordinates.  To put a cell value into table 't1' at
row 'r1' under column 'c1' marked with the time 'ts1', do:

  hbase> put 't1', 'r1', 'c1', 'value', ts1


hbase(main):014:0> put  'CUSTOMERS', 'CUSTID12345', 'CUSTOMER_INFO:EMAIL','Omkar.Joshi@lntinfotech.com'
0 row(s) in 0.0160 seconds

hbase(main):015:0>
hbase(main):016:0* scan 'CUSTOMERS'
ROW                              COLUMN+CELL
 CUSTID12345                     column=CUSTOMER_INFO:EMAIL, timestamp=1365600369284, value=Omkar.Joshi@lntinfotech.com
 CUSTID12345                     column=CUSTOMER_INFO:NAME, timestamp=1365600052104, value=Omkar Joshi
1 row(s) in 0.0230 seconds

As put takes max. 5 arguments, I was not able to figure out how to insert the entire row in one put command. This is resulting in incremental versions of the same row which isn't required and I'm not sure if CUSTOMER_ID is being used as a rowkey ! Thanks and regards !

Arnon Rotem-Gal-Oz · Accepted Answer

You don't, the key (and any other column for that matter) is a bytearray you can put whatever you want there- even encapsulate sub-entities
Not sure I understand that - each value is stored as key+column family + column qualifier + datetime + value - so the key is there.
HBase figures out which region a record will go to as it goes. When regions gets too big it repartitions. Also from time to time when there's too much junk HBase performs compactions to rearrage the files. You can control that when you pre-partition yourself, which is somehting you should definitely think about in the future. However, since it seems you are just starting out with HBase you can start with HBase taking care of that. Once you understand your usage patterns and data better you will probably want to go over that again.

You can read/hear a little about HBase schema design here and here

HBase - rowkey basics

Tags:

nosql

Kaliyug Antagonist

1 Answers

Arnon Rotem-Gal-Oz

Recent Activity

Donate For Us