Is it possible to have Druid datasource with 2 (or multiple) timestmaps in it? I know that Druid is time-based DB and I have no problem with the concept but I'd like to add another dimension with which I can work as with timestamp
e.g. User retention: Metric surely is specified to a certain date, but I also need to create cohorts based on users date of registration and rollup those dates maybe to a weeks, months or filter to only a certain time periods....
If the functionality is not supported, are there any plug-ins? Any dirty solutions?
Druid natively supports five basic column types: "long" (64 bit signed int), "float" (32 bit float), "double" (64 bit float) "string" (UTF-8 encoded strings), and "complex" (catch-all for more exotic data types like hyperUnique and approxHistogram columns).
Adding new data to existing datasources Druid can insert new data to an existing datasource by appending new segments to existing segment sets. It can also add new data by merging an existing set of segments with new data and overwriting the original set. Druid does not support single-record updates by primary key.
Druid is not a relational database, but some concepts are transferable. Rather than tables, we have datasources. As with relational databases, these are logical groupings of data that are represented as columns. Unlike relational databases, there is no concept of joins.
Although I'd rather wait for official implementation for timestamp dimensions full support in druid to be made, I've found a 'dirty' hack which I've been looking for.
DataSource Schema
First things first, I wanted to know, how much users logged in for each day, with being able to aggregate by date/month/year cohorts
here's the data schema I used:
"dataSchema": {
"dataSource": "ds1",
"parser": {
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "iso"
},
"dimensionsSpec": {
"dimensions": [
"user_id",
"platform",
"register_time"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [
{ "type" : "hyperUnique", "name" : "users", "fieldName" : "user_id" }
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "DAY",
"intervals": ["2015-01-01/2017-01-01"]
}
},
so the sample data should look something like (each record is login event):
{"user_id": 4151948, "platform": "portal", "register_time": "2016-05-29T00:45:36.000Z", "timestamp": "2016-06-29T22:18:11.000Z"}
{"user_id": 2871923, "platform": "portal", "register_time": "2014-05-24T10:28:57.000Z", "timestamp": "2016-06-29T22:18:25.000Z"}
as you can see, my "main" timestamp to which I calculate these metrics is timestamp field, where register_time is only the dimension in stringy - ISO 8601 UTC format .
Aggregating
And now, for the fun part: I've been able to aggregate by timestamp (date) and register_time (date again) thanks to Time Format Extraction Function
Query looking like that:
{
"intervals": "2016-01-20/2016-07-01",
"dimensions": [
{
"type": "extraction",
"dimension": "register_time",
"outputName": "reg_date",
"extractionFn": {
"type": "timeFormat",
"format": "YYYY-MM-dd",
"timeZone": "Europe/Bratislava" ,
"locale": "sk-SK"
}
}
],
"granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
"aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
"dataSource": "ds1",
"queryType": "groupBy"
}
Filtering
Solution for filtering is based on JavaScript Extraction Function with which I can transform date to UNIX time and use it inside (for example) bound filter:
{
"intervals": "2016-01-20/2016-07-01",
"dimensions": [
"platform",
{
"type": "extraction",
"dimension": "register_time",
"outputName": "reg_date",
"extractionFn": {
"type": "javascript",
"function": "function(x) {return Date.parse(x)/1000}"
}
}
],
"granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
"aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
"dataSource": "ds1",
"queryType": "groupBy"
"filter": {
"type": "bound",
"dimension": "register_time",
"outputName": "reg_date",
"alphaNumeric": "true"
"extractionFn": {
"type": "javascript",
"function": "function(x) {return Date.parse(x)/1000}"
}
}
}
I've tried to filter it 'directly' with javascript filter but I haven't been able to convince druid to return the correct records although I've doublecheck it with various JavaScript REPLs, but hey, I'm no JavaScript expert.
Unfortunately Druid has only one time-stamp column that can be used to do rollup plus currently druid treat all the other columns as a strings (except metrics of course) so you can add another string columns with time-stamp values, but the only thing you can do with it is filtering. I guess you might be able to hack it that way. Hopefully in the future druid will allow different type of columns and maybe time-stamp will be one of those.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With