We are trying to use HBase to store time-series data. The model we have currently stores the time-series as versions within a cell. This implies that the cell could end up storing millions of versions, and the queries on this time-series would retrieve a range of versions using the setTimeRange method available on the Get class in HBase.
e.g.
{
"row1" : {
"columnFamily1" : {
"column1" : {
1 : "1",
2 : "2"
},
"column2" : {
1 : "1"
}
}
}
}
Is this a reasonable model to store time-series data in HBase?
Is the alternate model of storing data in multiple columns (is it possible to query across columns) or rows more suitable?
Time-series applications (sensor data, application/system logging events, user interactions etc) present a new set of data storage challenges: very high velocity and very high volume of data. This talk will present the recent development in Apache HBase that make it a good fit for time-series applications.
hadoop - Using HBase to store time series data - Stack Overflow.
Storing time series data. Time series data is best stored in a time series database (TSDB) built specifically for handling metrics and events that are time-stamped. This is because time series data is often ingested in massive volumes that require a purpose-built database designed to handle that scale.
There are no data types in HBase; data is stored as byte arrays in the cells of HBase table. The content or the value in cell is versioned by the timestamp when the value is stored in the cell. So each cell of an HBase table may contain multiple versions of data.
I don't think you should use versioning to store the time series here. Not because it won't work, but because it's not designed for that particular use case and there are other ways.
I suggest you store the time series as the time step as the column qualifier and the value will be the data itself. Something like:
{
"row1" : {
"columnFamily1" : {
"col1-000001" : "1"
"col1-000002" : "2"
"col1-000003" : "91"
"col2-000001" : "31"
}
}
}
}
One nice thing here is that HBase stores the column qualifiers in sorted order, so when reading the time series back you should see the items in order.
Another realistic option would be to have the identifier for the record as the first part of the rowkey, but then have the time step in the rowkey as well. Something like:
{
"fooseries-00001" : {
"columnFamily1" : {
"val" : "1"
}
}
}
"fooseries-00002" : {
"columnFamily1" : {
"val" : "2"
}
}
}
}
This has the nice feature that it'll be pretty easy to do range scans in a particular series. For example, pulling out fooseries's steps 104 to 199 is going to be pretty trivial to implement and be efficient.
The downside to this one is deleting an entire series is going to require a bit more management and synchronization. Another downside is that MapReduce analytics are going to have a hard time doing any sort of analysis on this data. With the above approach, the entire time series will be passed to one map()
call, while here, map()
will be called for each frame.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With