Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scan a numeric range in hbase

Tags:

java

hadoop

hbase

My row key in hbase is numbers with different length, like 1,2,3.....32423480, 32423481..

When I use

scan 'table' {STARTROW => '1', ENDROW => '3'}  

to scan the table, I only want result with the row key 1,2,3, but it returns all the rows that start with 1,2,3, like 1003423,200034..

Is it possible to filter the row key range in numeric way use hbase shell or java api?

Thanks

like image 458
very fat Avatar asked Jan 06 '23 02:01

very fat


2 Answers

I am more familiar with Apache Accumulo (another BigTable implementation) but I believe that HBase operates similarly.

Keys are sorted lexicographically so as you've observed '11' sorts before '2'. Typically what you do is format the keys to force the sorting to make sense in your domain. So for instance, if you're keys max value is 99999 you could pad up to 5 characters.

1  becomes 00001
2  becomes 00002
11 becomes 00011
etc

This way HBase will sort your keys according to the expected numeric ordering and you can scan for ranges like (00001, 00003).

like image 57
jeff Avatar answered Jan 07 '23 15:01

jeff


Looks like your keys in HBase table are stored as strings. It means numbers like 1, 2, 3, etc are located in different parts of table and there are many another keys between them. So the answer to your question: it's not possible to scan the numeric range you want with the help of the only one scan operation.

But you have two possible ways to solve your problem:

1) Change the schema of your keys. Just convert your keys to integers and store them in HBase. This way your keys will be stored as 4-elements byte arrays (or 8-elements if you use long integers) and sorted in HBase exactly in numeric way. This schema is memory efficient but isn't shell-friendly because in HBase shell you can type only string represented keys by default. If you want shell-friendly but not so memory efficient way you can use solution provided in jeff's answer.

2) If you dont want to move all your data to the new key schema then you can use Get operations instead of Scan. Just call get operation per every element in your range. Obviously this method much less efficient then one scan but it let you get all data you want without data transformation.

like image 35
maxteneff Avatar answered Jan 07 '23 14:01

maxteneff