Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HBase: Create multiple tables or single table with many columns?

When does it make sense to create multiple tables as opposed to a single table with a large number of columns. I understand that typically tables have only a few column families (1-2) and that each column family can support 1000+ columns.

When does it make sense to create separate tables when HBase seems to perform well with a potentially large number of columns within a single table?

like image 400
spdcbr Avatar asked Jul 13 '18 13:07

spdcbr


People also ask

What is the best practice on deciding the number of column families for HBase table?

15) What is the best practice on deciding the number of column families for HBase table? It is ideal not to exceed the number of columns families per HBase table by 15 because every column family in HBase is stored as a single file, so large number of columns families will be required to read and merge multiple files.

How many columns can HBase have?

Technically, HBase can manage more than three of four column families. However, you need to understand how column families work to make the best use of them. The consequences explained here will give you a very good idea of what kind of pressure column families are putting on HBase.

Is HBase good for big data?

Apache HBase is an open-source, NoSQL, distributed big data store. It enables random, strictly consistent, real-time access to petabytes of data. HBase is very effective for handling large, sparse datasets.

Is HBase column-oriented or row oriented?

Yes, Hbase is known to be a column oriented database (where the column data stay together), the data in HBase for a particular row stay together and the column data is spread and not together.


1 Answers

Before answering the question itself, let me first state some of the major factors that come into play. I am going to assume that the file system in use is HDFS.

  1. A table is divided into non-overlapping partitions of the keyspace called regions.

  2. The key-range -> region mapping is stored in a special single region table called meta.

  3. The data in one HBase column family for a region is stored in a single HDFS directory. It's usually several files but for all intents and purposes, we can assume that a region's data for a column family is stored in a single file on HDFS called a StoreFile / HFile.

  4. A StoreFile is essentially a sorted file containing KeyValues. A KeyValue logically represents the following in order: (RowLength, RowKey, FamilyLength, FamilyName, Qualifier, Timestamp, Type). For example, if you have only two KVs in your region for a CF where the key is same but values in two columns, this is how the StoreFile will look like (except that it's actually byte encoded, and metadata like length etc. is also stored as I mentioned above):

    Key1:Family1:Qualifier1:Timestamp1:Value1:Put
    
    Key1:Family1:Qualifier2:Timestamp2:Value2:Put
    
  5. The StoreFile is divided into blocks (default 64KB) and the key range contained in each data block is indexed by multi-level indexes. A random lookup inside a single block can be done using index + binary search. However, the scans have to go serially through a particular block after locating the starting position in the first block needed for scan.

  6. HBase is a LSM-tree based database which means that it has an in-memory log (called Memstore) that is periodically flushed to the filesystem creating the StoreFiles. The Memstore is shared for all columns inside a single region for a particular column family.

There are several optimizations involved while dealing with reading/writing data from/to HBase, but the information given above holds true conceptually. Given the above statements, the following are the pros of having several columns vs several tables over the other approach:

Single Table with multiple columns

  1. Better on-disk compression due to prefix encoding since all data for a Key is stored together rather than on multiple files across tables. This also results in reduced disk activity due to smaller data size.
  2. Lesser load on meta table because the total number regions is going to be smaller. You'll have N number of regions for just one table rather than N*M regions for M tables. This means faster region lookup and low contention on meta table, which is a concern for large clusters.
  3. Faster reads and low IO amplification (causing less disk activity) when you need to read several columns for a single row key.
  4. You get advantage of row level transactions, batching and other performance optimizations when writing to multiple columns for a single row key.

When to use this:

  1. If you want to perform row level transactions across multiple columns, you have to put them in a single table.
  2. Even when you don't need row level transactions, but you often write to or query from multiple columns for the same row key. A good rule for thumb is that if on an average, more than 20% for your columns have values for a single row, you should try to put them together in a single table.
  3. When you have too many columns.

Multiple Tables

  1. Faster scans for each table and low IO amplification if the scans are mostly concerned only with one column (remember sequential look-ups in scans will unnecessarily read columns they don't need).
  2. Good logical separation of data, especially when you don't need to share row keys across columns. Have one table for one type of row keys.

When to use:

  1. When there is a clear logical separation of data. For example, if your row key schema differs across different sets of columns, put those sets of columns in separate tables.
  2. When only a small percentage of columns have values for a row key (Look below for a better approach).
  3. You want to have different storage configs for different sets of columns. E.g. TTL, compaction rate, blocking file counts, memstore size etc. (Look below for a better approach in this use case).

An alternative of sorts: Multiple CFs in single table

As you can see from above, there are pros of both the approaches. The choice becomes really difficult in cases where you have same structure of row key for several columns (so, you want to share row key for storage efficiency or need transactions across columns) but the data is very sparse (which means you write/read only small percentage of columns for a row key). It seems like you need the best of both worlds in this case. That's where column families come in. If you can partition your column set into logical subsets where you mostly access/read/write only to a single subset, or you need storage level configs per subset (like TTL, Storage class, write heavy compaction schedule etc.), then you can make each subset a column family. Since data for a particular column family is stored in single file (set of files), you get better locality while reading a subset of columns without slowing down the scans.

However, there is a catch:

Do not try to unnecessarily use column families. There is a cost associated with them, and HBase does not do well with 10+ CFs due to how region level write locks, monitoring etc. work in HBase. Use CFs only if you have a logical relationship between columns across CFs but you don't generally perform operations across CFs or need to have different storage configs for different CFs. It's perfectly fine to use only a single CF containing all your columns if you share row key schema across them, unless you have a very sparse data set, in which case you might need different CFs or different tables based on above mentioned points.

like image 178
Ashu Pachauri Avatar answered Oct 26 '22 18:10

Ashu Pachauri