Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a possibility to have another timestamp as dimension in Druid?

Tags:

druid

Is it possible to have Druid datasource with 2 (or multiple) timestmaps in it? I know that Druid is time-based DB and I have no problem with the concept but I'd like to add another dimension with which I can work as with timestamp

e.g. User retention: Metric surely is specified to a certain date, but I also need to create cohorts based on users date of registration and rollup those dates maybe to a weeks, months or filter to only a certain time periods....

If the functionality is not supported, are there any plug-ins? Any dirty solutions?

like image 868
Štefan Šoóš Avatar asked Jun 24 '16 08:06

Štefan Šoóš


People also ask

What are dimensions in Druid?

Druid natively supports five basic column types: "long" (64 bit signed int), "float" (32 bit float), "double" (64 bit float) "string" (UTF-8 encoded strings), and "complex" (catch-all for more exotic data types like hyperUnique and approxHistogram columns).

How do I update my Druid data?

Adding new data to existing datasources Druid can insert new data to an existing datasource by appending new segments to existing segment sets. It can also add new data by merging an existing set of segments with new data and overwriting the original set. Druid does not support single-record updates by primary key.

Is Druid relational?

Druid is not a relational database, but some concepts are transferable. Rather than tables, we have datasources. As with relational databases, these are logical groupings of data that are represented as columns. Unlike relational databases, there is no concept of joins.


2 Answers

Although I'd rather wait for official implementation for timestamp dimensions full support in druid to be made, I've found a 'dirty' hack which I've been looking for.

DataSource Schema

First things first, I wanted to know, how much users logged in for each day, with being able to aggregate by date/month/year cohorts

here's the data schema I used:

"dataSchema": {
  "dataSource": "ds1",
  "parser": {
    "parseSpec": {
      "format": "json",
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": [
            "user_id",
            "platform",
            "register_time"
        ],
        "dimensionExclusions": [],
        "spatialDimensions": []
      }
    }
  },
  "metricsSpec": [
    { "type" : "hyperUnique", "name" : "users", "fieldName" : "user_id" }
  ],
  "granularitySpec": {
    "type": "uniform",
    "segmentGranularity": "HOUR",
    "queryGranularity": "DAY",
          "intervals": ["2015-01-01/2017-01-01"]
  }
},

so the sample data should look something like (each record is login event):

{"user_id": 4151948, "platform": "portal", "register_time": "2016-05-29T00:45:36.000Z", "timestamp": "2016-06-29T22:18:11.000Z"}
{"user_id": 2871923, "platform": "portal", "register_time": "2014-05-24T10:28:57.000Z", "timestamp": "2016-06-29T22:18:25.000Z"}

as you can see, my "main" timestamp to which I calculate these metrics is timestamp field, where register_time is only the dimension in stringy - ISO 8601 UTC format .

Aggregating

And now, for the fun part: I've been able to aggregate by timestamp (date) and register_time (date again) thanks to Time Format Extraction Function

Query looking like that:

{
    "intervals": "2016-01-20/2016-07-01",
    "dimensions": [
        {
            "type": "extraction",
            "dimension": "register_time",
            "outputName": "reg_date",
            "extractionFn": {
                "type": "timeFormat",
                "format": "YYYY-MM-dd",
                "timeZone": "Europe/Bratislava" ,
                "locale": "sk-SK"
            }
        }
    ],
    "granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
    "aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
    "dataSource": "ds1",
    "queryType": "groupBy"
}

Filtering

Solution for filtering is based on JavaScript Extraction Function with which I can transform date to UNIX time and use it inside (for example) bound filter:

{
    "intervals": "2016-01-20/2016-07-01",
    "dimensions": [
        "platform",
        {
            "type": "extraction",
            "dimension": "register_time",
            "outputName": "reg_date",
            "extractionFn": {
                "type": "javascript",
                "function": "function(x) {return Date.parse(x)/1000}"
            }
        }
    ],
    "granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
    "aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
    "dataSource": "ds1",
    "queryType": "groupBy"
    "filter": {
        "type": "bound",
        "dimension": "register_time",
        "outputName": "reg_date",
        "alphaNumeric": "true"
        "extractionFn": {
            "type": "javascript",
            "function": "function(x) {return Date.parse(x)/1000}"
        }
    }
}

I've tried to filter it 'directly' with javascript filter but I haven't been able to convince druid to return the correct records although I've doublecheck it with various JavaScript REPLs, but hey, I'm no JavaScript expert.

like image 151
Štefan Šoóš Avatar answered Oct 09 '22 03:10

Štefan Šoóš


Unfortunately Druid has only one time-stamp column that can be used to do rollup plus currently druid treat all the other columns as a strings (except metrics of course) so you can add another string columns with time-stamp values, but the only thing you can do with it is filtering. I guess you might be able to hack it that way. Hopefully in the future druid will allow different type of columns and maybe time-stamp will be one of those.

like image 20
Slim Bouguerra Avatar answered Oct 09 '22 01:10

Slim Bouguerra