Have been reading up on Hadoop and HBase lately, and came across this term- <blockquote> HBase is an open-source, distributed, sparse, column-oriented store... </blockquote> What do they mean by sparse? Does it have something to do with a sparse matrix? I am guessing it is a property of the type of data it can store efficiently, and hence, would like to know more about it.

In a regular database, rows are sparse but columns are not. When a row is created, storage is allocated for every column, irrespective of whether a value exists for that field (a field being storage allocated for the intersection of a row and and a column). This allows fixed length rows greatly improving read and write times. Variable length data types are handled with an analogue of pointers. Sparse columns will incur a performance penalty and are unlikely to save you much disk space because the space required to indicate NULL is smaller than the 64-bit pointer required for the linked-list style of chained pointer architecture typically used to implement very large non-contiguous storage. Storage is cheap. Performance isn't.

What is meant by sparse data/ datastore/ database?

Tags:

database

database-schema

hadoop

hbase

sparse-matrix

Have been reading up on Hadoop and HBase lately, and came across this term-

HBase is an open-source, distributed, sparse, column-oriented store...

What do they mean by sparse? Does it have something to do with a sparse matrix? I am guessing it is a property of the type of data it can store efficiently, and hence, would like to know more about it.

233

asked Jul 05 '11 18:07

Jai

2 Answers

In a regular database, rows are sparse but columns are not. When a row is created, storage is allocated for every column, irrespective of whether a value exists for that field (a field being storage allocated for the intersection of a row and and a column).

This allows fixed length rows greatly improving read and write times. Variable length data types are handled with an analogue of pointers.

Sparse columns will incur a performance penalty and are unlikely to save you much disk space because the space required to indicate NULL is smaller than the 64-bit pointer required for the linked-list style of chained pointer architecture typically used to implement very large non-contiguous storage.

Storage is cheap. Performance isn't.

answered Oct 19 '22 08:10

Peter Wone

At the storage level, all data is stored as a key-value pair. Each storage file contains an index so that it knows where each key-value starts and how long it is.

As a consequence of this, if you have very long keys (e.g. a full URL), and a lot of columns associated with that key, you could be wasting some space. This is ameliorated somewhat by turning compression on.

See: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

for more information on HBase storage

answered Oct 19 '22 09:10

David

Related questions
                            
                                Why are indices on columns with very few unique values not effective?
                            
                                How to store lightweight formatting (Textile, Markdown) in database?
                            
                                Should I add this new column to customers table or to a separate new table?
                            
                                can we list all tables in msaccess database using sql?
                            
                                How do I test locally against SQL Azure?
                            
                                Max length of "tzname" field / timezone identifier name
                            
                                How to understand the "Availability" of the CAP theorem?
                            
                                Why does referencing a SQLite rowid cause foreign key mismatch?
                            
                                Viewer/Query Analyzer for SQLite databases [closed]
                            
                                NoSQL / RDBMS hybrid with referential integrity (delete cascade)?
                            
                                MySQL equivalent of Oracle's SEQUENCE.NEXTVAL
                            
                                Avoiding PostgreSQL deadlocks when performing bulk update and delete operations
                            
                                How to use findOrCreate in Sequelize
                            
                                How to open an .accdb file in Ubuntu?
                            
                                How to disable PostgreSQL triggers in one transaction only?
                            
                                How can I change case of database name in MySQL?
                            
                                How to implement Materialized View with MySQL?
                            
                                What is the best way to implement many-to-many relationships using ORMLite?
                            
                                SQL create database if not exists, unexpected behaviour
                            
                                how to change install postgis location? postgres

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With