Using HBase to store time series data

Tags:

We are trying to use HBase to store time-series data. The model we have currently stores the time-series as versions within a cell. This implies that the cell could end up storing millions of versions, and the queries on this time-series would retrieve a range of versions using the setTimeRange method available on the Get class in HBase.

e.g.

{
    "row1" : {
        "columnFamily1" : {
            "column1" : {
                1 : "1",
                2 : "2"
            },
            "column2" : {
                1 : "1"
            }
        }
    }
}

Is this a reasonable model to store time-series data in HBase?

Is the alternate model of storing data in multiple columns (is it possible to query across columns) or rows more suitable?

205

asked Nov 08 '10 17:11

gurrie

1 Answers

I don't think you should use versioning to store the time series here. Not because it won't work, but because it's not designed for that particular use case and there are other ways.

I suggest you store the time series as the time step as the column qualifier and the value will be the data itself. Something like:

{
    "row1" : {
        "columnFamily1" : {
            "col1-000001" : "1"
            "col1-000002" : "2"
            "col1-000003" : "91"
            "col2-000001" : "31"
            }
        }
    }
}

One nice thing here is that HBase stores the column qualifiers in sorted order, so when reading the time series back you should see the items in order.

Another realistic option would be to have the identifier for the record as the first part of the rowkey, but then have the time step in the rowkey as well. Something like:

{
    "fooseries-00001" : {
        "columnFamily1" : {
            "val" : "1"
            }
        }
    }
    "fooseries-00002" : {
        "columnFamily1" : {
            "val" : "2"
            }
        }
    }

}

This has the nice feature that it'll be pretty easy to do range scans in a particular series. For example, pulling out fooseries's steps 104 to 199 is going to be pretty trivial to implement and be efficient.

The downside to this one is deleting an entire series is going to require a bit more management and synchronization. Another downside is that MapReduce analytics are going to have a hard time doing any sort of analysis on this data. With the above approach, the entire time series will be passed to one map() call, while here, map() will be called for each frame.

answered Oct 16 '22 11:10

Donald Miner

Related questions
                            
                                Spark - Container is running beyond physical memory limits
                            
                                How to balance my data across the partitions?
                            
                                Apache Spark YARN mode startup takes too long (10+ secs)
                            
                                What's the successor of mrunit?
                            
                                Amazon S3 architecture [closed]
                            
                                HDFS replication factor
                            
                                java.io.IOException: Incomplete HDFS URI, no host
                            
                                Generate metadata for parquet files
                            
                                hbase connection refused
                            
                                Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)
                            
                                How does HDFS with append works
                            
                                How to debug hadoop mapreduce jobs from eclipse?
                            
                                YARN Resourcemanager not connecting to nodemanager
                            
                                Hadoop Mapper is failing because of "Container killed by the ApplicationMaster"
                            
                                Is there maximum size of string data type in Hive?
                            
                                Is there a way to add nodes to a running Hadoop cluster?
                            
                                How do I control a hive job name but keep the stage info?
                            
                                Spark : check your cluster UI to ensure that workers are registered
                            
                                Install Hue without Cloudera
                            
                                How to submit a spark job on a remote master node in yarn client mode?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using HBase to store time series data

Tags:

hadoop

hbase

opentsdb

gurrie

People also ask

1 Answers

Donald Miner

Recent Activity

Donate For Us