Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hbase column family

Tags:

hbase

Hbase documentation says that avoid creating more than 2-3 column families because Hbase does not handle more than 2-3 column families very well. The reason for this is compaction and flushing and hence the IO. However, if all my columns are always populated (for every row) then I think this reasoning is not that important, so, considering that my access to columns is completely random (I want to access any combination of columns) - can I have one column family -one column configuration (effectively trying to make it pure columnar).

There are many blogs/wikis explaining this but they all seem to contradict and add more confusion. I just don't seem to be able to digest the fact that Hbase prefers one column family, then what's the point of calling is a column store?

like image 455
PrakashT Avatar asked Mar 05 '12 14:03

PrakashT


People also ask

What is HBase column family?

An HBase table contains column families , which are the logical and physical grouping of columns. There are column qualifiers inside of a column family, which are the columns. Column families contain columns with time stamped versions. Columns only exist when they are inserted, which makes HBase a sparse database.

How many column families does HBase have?

Technically, HBase can manage more than three of four column families. However, you need to understand how column families work to make the best use of them.

Is HBase a column family database?

HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which are the key value pairs. A table have multiple column families and each column family can have any number of columns. Subsequent column values are stored contiguously on the disk.

How do I create a column family in HBase?

Simply press the "+" button in the "Alter table" page and add your new column family with all settings you need.


1 Answers

Currently (though this is expected to change), all of the column families for a region are flushed together. This is the primary reason why people say "HBase doesn't do well with more than 2 or 3 column families". Consider two CF's, each with one column. Column A:A stores whole web page texts. Column B:B stores the number of words in the page. So every time we flush A:A (which will happen more often because A:A's data is far bigger), we also need to go through a whole separate file I/O juggling routing for column B:B, even though there is no need to- with B:B only holding numbers, I could go for months without flushing it.

If you store A and B in the same column family (A:A and A:B), you will probably see vastly better flush I/O performance, and because most HBase reads are purely from the memstore, you will probably find that read speeds are equivalent.

Also, and perhaps more importantly, if the cardinality of the columns is wildly different, then your regionservers will need to maintain useless mostly-empty files for your less-dense column families. This will never change.

All of this is available in the HBase Book.

So, as in all such performance situations, measure before deciding what the "correct" path is.

like image 130
Chris Shain Avatar answered Nov 13 '22 06:11

Chris Shain