Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is meant by sparse data/ datastore/ database?

Have been reading up on Hadoop and HBase lately, and came across this term-

HBase is an open-source, distributed, sparse, column-oriented store...

What do they mean by sparse? Does it have something to do with a sparse matrix? I am guessing it is a property of the type of data it can store efficiently, and hence, would like to know more about it.

like image 233
Jai Avatar asked Jul 05 '11 18:07

Jai


People also ask

How do you know if data is sparse?

To check whether a matrix is a sparse matrix, we only need to check the total number of elements that are equal to zero.

Why is HBase good for sparse data?

HBase is built to be a fault-tolerant application hosting a few large tables of sparse data (billions/trillions of rows by millions of columns), while allowing for very low latency and near real-time random reads and random writes.

What is sparse and dense data?

Sparsity and density are terms used to describe the percentage of cells in a database table that are not populated and populated, respectively. The sum of the sparsity and density should equal 100%. A table that is 10% dense has 10% of its cells populated with non-zero values.

What is sparse data database?

Sparse data is a variable in which the cells do not contain actual data within data analysis. Sparse data is empty or has a zero value. Sparse data is different from missing data because sparse data shows up as empty or zero while missing data doesn't show what some or any of the values are.


2 Answers

In a regular database, rows are sparse but columns are not. When a row is created, storage is allocated for every column, irrespective of whether a value exists for that field (a field being storage allocated for the intersection of a row and and a column).

This allows fixed length rows greatly improving read and write times. Variable length data types are handled with an analogue of pointers.

Sparse columns will incur a performance penalty and are unlikely to save you much disk space because the space required to indicate NULL is smaller than the 64-bit pointer required for the linked-list style of chained pointer architecture typically used to implement very large non-contiguous storage.

Storage is cheap. Performance isn't.

like image 97
Peter Wone Avatar answered Oct 19 '22 08:10

Peter Wone


At the storage level, all data is stored as a key-value pair. Each storage file contains an index so that it knows where each key-value starts and how long it is.

As a consequence of this, if you have very long keys (e.g. a full URL), and a lot of columns associated with that key, you could be wasting some space. This is ameliorated somewhat by turning compression on.

See: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

for more information on HBase storage

like image 44
David Avatar answered Oct 19 '22 09:10

David