Because HBase tables are sparse tables, HBase stores for every cell not only the value, but all the information required to identify the cell (often described as the Key, not to be confused with the RowKey). The Key looks as follows:
RowKey-ColumnFamily-ColumnQualifier-Timestamp
And all this information is stored for every entry. That's why there is the recommendation to use short names for Column Families and Column Qualifiers to reduce additional overhead.
My Question: Why do I need to store the ColumnFamily for every entry? From my understanding every Store File belongs to exactly one Column Family. Wouldn't it be enough to store the Column Family name once per Store File? This would reduce overhead, arbitrary Column Family names could be used and we would still be able to identify the Column Family for every entry. What am I missing here?
Like a relational database, tables in HBase consist of rows and columns. In HBase, the columns are grouped together in column families. This grouping is expressed logically as a layer in the map of maps. Column families are also expressed physically. Each column family gets its own set of HFiles on disk. This physical isolation allows the underlying HFiles of one column family to be managed in isolation of the others. As far as compactions are concerned, the HF iles for each column family are managed independently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With