I would really appreciate if somebody put some light on the choice of HBase as a data storage engine for OpenTSDB? Which other choices, such as Whisper (Graphite front-end + Carbon persistence), were considered? How is a column-oriented db such as HBase a better choice for time-series data?

I chose HBase because it scales. Whisper is much like RRD, it's a fixed-size database, it must destroy data in order to work within its space constraints. HBase offers the following properties that make it very well suited for large scale time series databases: <ol> <li> Linear scaling. Want to store data? Add more nodes. At StumbleUpon, where I wrote OpenTSDB, our time series data was co-located on a 20-node cluster that was primarily used for analytics and batch processing. The cluster grew to 120 nodes fairly quickly, and meanwhile OpenTSDB, which makes up only a tiny fraction of the cluster's workload, grew to half a trillion data points.</li> <li> Automatic replication. Your data is stored in HDFS, which by default means 3 replicas on 3 different machines. If a machine or a drives dies, no big deal. Drives and machines die all the time when you build commodity servers. But the thing is: you don't really care.</li> <li> Efficient scans. Most time series data is used to answer questions that are like "what are the data points between time X and Y". If you structure your keys properly, you can implement this very efficiently with HBase with a simple scan operation.</li> <li> High write throughput. The Bigtable design, which HBase follows, uses LSM trees instead of, say, B-trees, to make writes cheaper (at the expense of potentially more expensive reads).</li> </ol> The fact that HBase is column oriented wasn't nearly as important a consideration as the fact that it's a big sorted key-value system that really scales. All RRD-based and RRD-derived tools couldn't satisfy the scale requirements of being able to accurately store billions and billions of data points forever for very cheap (just a few bytes of actual disk space per data point).

Why OpenTSDB chose HBase for Time Series data storage?

1 Answers

I chose HBase because it scales. Whisper is much like RRD, it's a fixed-size database, it must destroy data in order to work within its space constraints. HBase offers the following properties that make it very well suited for large scale time series databases:

Linear scaling. Want to store data? Add more nodes. At StumbleUpon, where I wrote OpenTSDB, our time series data was co-located on a 20-node cluster that was primarily used for analytics and batch processing. The cluster grew to 120 nodes fairly quickly, and meanwhile OpenTSDB, which makes up only a tiny fraction of the cluster's workload, grew to half a trillion data points.
Automatic replication. Your data is stored in HDFS, which by default means 3 replicas on 3 different machines. If a machine or a drives dies, no big deal. Drives and machines die all the time when you build commodity servers. But the thing is: you don't really care.
Efficient scans. Most time series data is used to answer questions that are like "what are the data points between time X and Y". If you structure your keys properly, you can implement this very efficiently with HBase with a simple scan operation.
High write throughput. The Bigtable design, which HBase follows, uses LSM trees instead of, say, B-trees, to make writes cheaper (at the expense of potentially more expensive reads).

The fact that HBase is column oriented wasn't nearly as important a consideration as the fact that it's a big sorted key-value system that really scales.

All RRD-based and RRD-derived tools couldn't satisfy the scale requirements of being able to accurately store billions and billions of data points forever for very cheap (just a few bytes of actual disk space per data point).

123

answered Sep 26 '22 08:09

tsuna

Related questions
                            
                                Hbase put shell command
                            
                                Is there a good library for accessing HBase from Python? [closed]
                            
                                Hbase Schema Nested Entity
                            
                                Using Phoenix with Cloudera Hbase (installed from repo)
                            
                                Hbase client can't connect to remote Hbase server
                            
                                Repair HBase table (unassigned region in transition)
                            
                                get "ERROR: Can't get master address from ZooKeeper; znode data == null" when using Hbase shell
                            
                                How to export data to text file in Apache phoenix?
                            
                                can not access HBase status UI on http://localhost:60010
                            
                                HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification
                            
                                hbase connection refused
                            
                                Is there a way to add nodes to a running Hadoop cluster?
                            
                                Using HBase to store time series data
                            
                                How to connect HBase and Spark using Python?
                            
                                HBase cassandra couchdb mongodb..any fundamental difference?
                            
                                how to get the row key from hbase scan result
                            
                                How to clear a table in hbase?
                            
                                ./bootstrap: 17: exec: autoreconf: not found : OpenTSDB installation
                            
                                HBase getting all timestamped values for a cell
                            
                                A script that deletes all tables in Hbase

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why OpenTSDB chose HBase for Time Series data storage?

Tags:

time-series

hbase

opentsdb

Rajan

People also ask

1 Answers

tsuna

Recent Activity

Donate For Us