Have been reading up on Hadoop and HBase lately, and came across this term-
HBase is an open-source, distributed, sparse, column-oriented store...
What do they mean by sparse? Does it have something to do with a sparse matrix? I am guessing it is a property of the type of data it can store efficiently, and hence, would like to know more about it.
To check whether a matrix is a sparse matrix, we only need to check the total number of elements that are equal to zero.
HBase is built to be a fault-tolerant application hosting a few large tables of sparse data (billions/trillions of rows by millions of columns), while allowing for very low latency and near real-time random reads and random writes.
Sparsity and density are terms used to describe the percentage of cells in a database table that are not populated and populated, respectively. The sum of the sparsity and density should equal 100%. A table that is 10% dense has 10% of its cells populated with non-zero values.
Sparse data is a variable in which the cells do not contain actual data within data analysis. Sparse data is empty or has a zero value. Sparse data is different from missing data because sparse data shows up as empty or zero while missing data doesn't show what some or any of the values are.
In a regular database, rows are sparse but columns are not. When a row is created, storage is allocated for every column, irrespective of whether a value exists for that field (a field being storage allocated for the intersection of a row and and a column).
This allows fixed length rows greatly improving read and write times. Variable length data types are handled with an analogue of pointers.
Sparse columns will incur a performance penalty and are unlikely to save you much disk space because the space required to indicate NULL is smaller than the 64-bit pointer required for the linked-list style of chained pointer architecture typically used to implement very large non-contiguous storage.
Storage is cheap. Performance isn't.
At the storage level, all data is stored as a key-value pair. Each storage file contains an index so that it knows where each key-value starts and how long it is.
As a consequence of this, if you have very long keys (e.g. a full URL), and a lot of columns associated with that key, you could be wasting some space. This is ameliorated somewhat by turning compression on.
See: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
for more information on HBase storage
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With