I am working on a batch job to process a batch of Put objects into HBase through HTableInterface. There are two API methods, HTableInterface.put(List) and HTableInterface.put(Put).
I am wondering, for the same number of Put objects, is the batch put faster than putting them one by one?
Another question is, I am putting a very large Put object, which caused the job to fail. There seems a limit on the size of a Put object. How large can it be?
put(List<Put> puts)
or put(Put aPut)
are the same under the hood. They both call doPut(List<Put> puts)
.
What matters is the buffer size as mentioned by @ozhang. e.g. The default value is 2MB.
<property>
<name>hbase.client.write.buffer</name>
<value>2097152</value>
</property>
There will be 1 RPC every time the write buffer is filled up and a flushCommits()
is triggered. So if your application is flushing to often because your objects are relatively big, experimenting with increasing the write buffer size will solve the problem.
If your key value size is large, then using list of puts may have a client side buffer size problem.
<property>
<name>hbase.client.write.buffer</name>
<value>20971520</value>
</property>
Client collects upto 2mb data by default and then flushes it. So you also have to increase this value
For batch puts it's better if you construct a list of puts and then call HTableInterface.put(List<Put> puts)
because it uses a single RPC call to commit the batch, but depending on the size of the list write buffer may flush it all or not
You will definitely save on the overhead of multiple RPC requests versus one by using put(List puts) method.
About the very large Put object: there is a limitation by default on maximum KeyValue size of 10MB. I think you have to increase that to store bigger KeyValue objects.
hbase.client.keyvalue.maxsize
Specifies the combined maximum allowed size of a KeyValue instance. This is to set an upper boundary for a single entry saved in a storage file. Since they cannot be split it helps avoiding that a region cannot be split any further because the data is too large. It seems wise to set this to a fraction of the maximum region size. Setting it to zero or less disables the check.
Default: 10485760
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With