Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why HBase count operation so slow

The command is:

count 'tableName'. 

It's very slow to get the total row number of the whole table.

My situation is:

  • I have one master and two slaves, each node with 16 cpus and 16G memory.

  • My table only has one column family with two columns: title and Content.

  • The title column at most has 100B bytes, the content may have 5M bytes.

  • Right now the table has 1550 rows, every time when I count the row number, it would take about 2 minutes.

I'm very curious why hbase so slow on this operation, I guess it's even slower then mysql. Is Cassandra faster than Hbase on these operations?

like image 783
Jack Avatar asked Apr 27 '15 06:04

Jack


2 Answers

First of all, you have very small amount of data. If you have that kind of volume, then IMO using NoSql would provide you no advantage. Your test is not appropriate to judge performance of HBase and Cassandra. Both have their own use cases and sweet spots.

count command on hbase is running a single threaded java program to do counts of rows. Still, I am surprised that its taking 2 mins to count 1550 rows. If you would like to do counts in faster way(for bigger dataset) then you should run MapReduce job of HBase Row_Counter.
Run MapReduce job by running this:

bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter

like image 189
Anil Gupta Avatar answered Sep 28 '22 10:09

Anil Gupta


First of all, please remind that to make use of data locality, your "slaves" (better known as RegionServers) must have also the DataNode role, not doing so is a performance killer.

Due performance reasons HBase does not mantain a live counter of rows. To perform a count the HBase shell client needs to retrieve all the data, and that means that if your average row has 5M of data, then the client would retrieve 5M * 1550 from the regionservers just to count, which is a lot.

To speed it up you have 2 options:

  • If you need realtime responses you can maintain your own live counter of rows making use of HBase atomic counters: each time you insert you increment the counter, and each time you delete you decrement the counter. It can even be in the same table, just use another column family to store it.

  • If you don't need realtime run a distributed row counter map-reduce job (source code) forcing the the scan to just use the smallest column family & column available to avoid reading big rows, each RegionServer will read the locally stored data and no network I/O will be required. In this case you may need to add a new column to your rows with a small value if you don't have one yet (a boolean is your best option).

like image 35
Rubén Moraleda Avatar answered Sep 28 '22 12:09

Rubén Moraleda