Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Streaming event data with dynamically changing schema into ClickHouse

Tags:

clickhouse

I am evaluating Clickhouse for storing streaming events having variable event schema. The event schema is nested JSON. The new properties could be of different data types, so we cannot create a normalized key-value pair.

Based on the documentation, it looks like whenever there is a change in the schema, we have to explicitly alter the schema of the table and then insert the new record. As the inserts are happening from multiple source workers, adding 'IF NOT EXISTS' clause can help minimizing the race conditions.

Can anyone advise if there is a better way to keep inserting events with dynamically changing schemas into Clickhouse? Is there any way to have Clickhouse infer the updated schema based on the event being inserted?

like image 359
KVK Avatar asked Nov 21 '25 02:11

KVK


2 Answers

Each column on insert requires 2MB RAM.

1000 columns table will eat 2GB on insert.

Each column =~ 3 files on disk += 3 random read/write on insert.

Alter table sometimes will be blocked by merges/selects and could be executed hours.

I would pick out MAIN columns (which will be used in where most often) from JSON and put them into dedicated columns. Other columns into K/V - 2 columns (Key Array(String), Value Array(String)). Otherwise use MongoDb.

like image 80
Denny Crane Avatar answered Nov 22 '25 14:11

Denny Crane


What about using a fix schema in ClickHouse that would support any type in your events. For example using Nested columns or JSON columns. I tried this for a PoC using Nested columns with good results. First I identified 4 different data types: strings, numbers, booleans and dates.

The ClickHouse schema would look like this

CREATE TABLE custom_events
(
    // ... Normal columns needed depending on your use case
    attribute_string Nested
    (
        name String,
        value String
    ),
    attribute_number Nested
    (
        name String,
        value Float32
    ),
    attribute_boolean Nested
    (
        name String,
        value UInt8
    ),
    attribute_date Nested
    (
        name String,
        value Datetime('UTC')
    )
)
ENGINE = MergeTree()
ORDER BY (
   ...
);

The Nested columns will be converted into two columns of Array type by ClickHouse. To use this schema you would need to group the events fields by type and flatten them before proceeding to insert into ClickHouse. For example the following event in JSON

{
    "event_id": 1,
    "event_date": "2018-08-02 10:06:23",
    "data": {
        "rides": [
            {
                "km": 93,
                "vehicle": "plane",
                "people": 2.15,
                "finished": true,
                "date": "2019-06-24T18:52:34"
            }
        ]
    }
}

Could be flattened like this

{
    "event_id": 1,
    "event_date": "2018-08-02 10:06:23",
    "data.rides.0.km": 93,
    "data.rides.0.vehicle": "plane",
    "data.rides.0.people": 2.15,
    "data.rides.0.finished": true,
    "data.rides.0.date": "2019-06-24T18:52:34"
}

And then you can insert into ClickHouse like so

insert into custom_events VALUES (1, "2018-08-02 10:06:23", 
['data.rides.0.vehicle'], ['plane'], // This is Key and Value for strings
['data.rides.0.km', 'data.rides.0.km'], [93, 2.15] // This is Key and Value for numbers
... // do the same for each data type
)

Then you can query the data leveraging all the great high order functions provided by ClickHouse

SELECT
    arrayFilter(x -> ((x.1) = 1), arrayMap((x, y) -> (x LIKE 'data.rides.%.km', y), attribute_number.name, attribute_number.value)).2 AS element,
    event_date
FROM custom_events
WHERE (event_id = 1) AND notEmpty(element)

The query doesn't make sense as I redacted the fields a bit, but you can get the idea. This way you don't need to modify your schema and you can store any arbitrary JSON schema with the same fix schema in ClickHouse.

like image 21
lloiacono Avatar answered Nov 22 '25 16:11

lloiacono