I am evaluating Clickhouse for storing streaming events having variable event schema. The event schema is nested JSON. The new properties could be of different data types, so we cannot create a normalized key-value pair.
Based on the documentation, it looks like whenever there is a change in the schema, we have to explicitly alter the schema of the table and then insert the new record. As the inserts are happening from multiple source workers, adding 'IF NOT EXISTS' clause can help minimizing the race conditions.
Can anyone advise if there is a better way to keep inserting events with dynamically changing schemas into Clickhouse? Is there any way to have Clickhouse infer the updated schema based on the event being inserted?
Each column on insert requires 2MB RAM.
1000 columns table will eat 2GB on insert.
Each column =~ 3 files on disk += 3 random read/write on insert.
Alter table sometimes will be blocked by merges/selects and could be executed hours.
I would pick out MAIN columns (which will be used in where most often) from JSON and put them into dedicated columns. Other columns into K/V - 2 columns (Key Array(String), Value Array(String)). Otherwise use MongoDb.
What about using a fix schema in ClickHouse that would support any type in your events. For example using Nested columns or JSON columns. I tried this for a PoC using Nested columns with good results. First I identified 4 different data types: strings, numbers, booleans and dates.
The ClickHouse schema would look like this
CREATE TABLE custom_events
(
// ... Normal columns needed depending on your use case
attribute_string Nested
(
name String,
value String
),
attribute_number Nested
(
name String,
value Float32
),
attribute_boolean Nested
(
name String,
value UInt8
),
attribute_date Nested
(
name String,
value Datetime('UTC')
)
)
ENGINE = MergeTree()
ORDER BY (
...
);
The Nested columns will be converted into two columns of Array type by ClickHouse. To use this schema you would need to group the events fields by type and flatten them before proceeding to insert into ClickHouse. For example the following event in JSON
{
"event_id": 1,
"event_date": "2018-08-02 10:06:23",
"data": {
"rides": [
{
"km": 93,
"vehicle": "plane",
"people": 2.15,
"finished": true,
"date": "2019-06-24T18:52:34"
}
]
}
}
Could be flattened like this
{
"event_id": 1,
"event_date": "2018-08-02 10:06:23",
"data.rides.0.km": 93,
"data.rides.0.vehicle": "plane",
"data.rides.0.people": 2.15,
"data.rides.0.finished": true,
"data.rides.0.date": "2019-06-24T18:52:34"
}
And then you can insert into ClickHouse like so
insert into custom_events VALUES (1, "2018-08-02 10:06:23",
['data.rides.0.vehicle'], ['plane'], // This is Key and Value for strings
['data.rides.0.km', 'data.rides.0.km'], [93, 2.15] // This is Key and Value for numbers
... // do the same for each data type
)
Then you can query the data leveraging all the great high order functions provided by ClickHouse
SELECT
arrayFilter(x -> ((x.1) = 1), arrayMap((x, y) -> (x LIKE 'data.rides.%.km', y), attribute_number.name, attribute_number.value)).2 AS element,
event_date
FROM custom_events
WHERE (event_id = 1) AND notEmpty(element)
The query doesn't make sense as I redacted the fields a bit, but you can get the idea. This way you don't need to modify your schema and you can store any arbitrary JSON schema with the same fix schema in ClickHouse.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With