Hbase documentation says that avoid creating more than 2-3 column families because Hbase does not handle more than 2-3 column families very well. The reason for this is compaction and flushing and hence the IO. However, if all my columns are always populated (for every row) then I think this reasoning is not that important, so, considering that my access to columns is completely random (I want to access any combination of columns) - can I have one column family -one column configuration (effectively trying to make it pure columnar).
There are many blogs/wikis explaining this but they all seem to contradict and add more confusion. I just don't seem to be able to digest the fact that Hbase prefers one column family, then what's the point of calling is a column store?
An HBase table contains column families , which are the logical and physical grouping of columns. There are column qualifiers inside of a column family, which are the columns. Column families contain columns with time stamped versions. Columns only exist when they are inserted, which makes HBase a sparse database.
Technically, HBase can manage more than three of four column families. However, you need to understand how column families work to make the best use of them.
HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which are the key value pairs. A table have multiple column families and each column family can have any number of columns. Subsequent column values are stored contiguously on the disk.
Simply press the "+" button in the "Alter table" page and add your new column family with all settings you need.
Currently (though this is expected to change), all of the column families for a region are flushed together. This is the primary reason why people say "HBase doesn't do well with more than 2 or 3 column families". Consider two CF's, each with one column. Column A:A stores whole web page texts. Column B:B stores the number of words in the page. So every time we flush A:A (which will happen more often because A:A's data is far bigger), we also need to go through a whole separate file I/O juggling routing for column B:B, even though there is no need to- with B:B only holding numbers, I could go for months without flushing it.
If you store A and B in the same column family (A:A and A:B), you will probably see vastly better flush I/O performance, and because most HBase reads are purely from the memstore, you will probably find that read speeds are equivalent.
Also, and perhaps more importantly, if the cardinality of the columns is wildly different, then your regionservers will need to maintain useless mostly-empty files for your less-dense column families. This will never change.
All of this is available in the HBase Book.
So, as in all such performance situations, measure before deciding what the "correct" path is.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With