Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is HBase batch put put(List<Put>) faster than put(Put)? What is the capacity of a Put object?

Tags:

hbase

I am working on a batch job to process a batch of Put objects into HBase through HTableInterface. There are two API methods, HTableInterface.put(List) and HTableInterface.put(Put).

I am wondering, for the same number of Put objects, is the batch put faster than putting them one by one?

Another question is, I am putting a very large Put object, which caused the job to fail. There seems a limit on the size of a Put object. How large can it be?

like image 899
Derek Zhang Avatar asked Feb 26 '15 22:02

Derek Zhang


4 Answers

put(List<Put> puts) or put(Put aPut) are the same under the hood. They both call doPut(List<Put> puts).

What matters is the buffer size as mentioned by @ozhang. e.g. The default value is 2MB.

<property>   
     <name>hbase.client.write.buffer</name>
     <value>2097152</value> 
</property>

There will be 1 RPC every time the write buffer is filled up and a flushCommits() is triggered. So if your application is flushing to often because your objects are relatively big, experimenting with increasing the write buffer size will solve the problem.

like image 187
Thai Bui Avatar answered Oct 28 '22 13:10

Thai Bui


If your key value size is large, then using list of puts may have a client side buffer size problem.

<property>   
    <name>hbase.client.write.buffer</name>
    <value>20971520</value> 
</property>

Client collects upto 2mb data by default and then flushes it. So you also have to increase this value

like image 23
ozhang Avatar answered Oct 28 '22 13:10

ozhang


For batch puts it's better if you construct a list of puts and then call HTableInterface.put(List<Put> puts) because it uses a single RPC call to commit the batch, but depending on the size of the list write buffer may flush it all or not

like image 33
Sleiman Jneidi Avatar answered Oct 28 '22 15:10

Sleiman Jneidi


You will definitely save on the overhead of multiple RPC requests versus one by using put(List puts) method.

About the very large Put object: there is a limitation by default on maximum KeyValue size of 10MB. I think you have to increase that to store bigger KeyValue objects.

hbase.client.keyvalue.maxsize

Specifies the combined maximum allowed size of a KeyValue instance. This is to set an upper boundary for a single entry saved in a storage file. Since they cannot be split it helps avoiding that a region cannot be split any further because the data is too large. It seems wise to set this to a fraction of the maximum region size. Setting it to zero or less disables the check.

Default: 10485760

like image 33
Alexander Avatar answered Oct 28 '22 14:10

Alexander