I'm going to migrate data from PostgreSQL database to Yandex's ClickHouse.
One of the fields in a source table is of type JSON - called additional_data
. So, PostgreSQL allows me to access json attributes during e.g. SELECT ...
queries with ->>
and ->
and so on.
I need the same behavior to persist in my resulting table in ClickHouse storage. (i.e. the ability to parse JSON during select queries and/or when using filtering and aggregation clauses)
Here is what I've done during CREATE TABLE ...
in ClickHouse client:
create table if not exists analytics.events
(
uuid UUID,
...,
created_at DateTime,
updated_at DateTime,
additional_data Nested (
message Nullable(String),
eventValue Nullable(String),
rating Nullable(String),
focalLength Nullable(Float64)
)
)
engine = MergeTree
ORDER BY (uuid, created_at)
PRIMARY KEY uuid;
Is that a good choice how to store JSON-serializable data? Any Ideas?
Maybe It's better to store a JSON data as a plain String
instead of Nested
and playing with It using special functions?
Data is stored in a set of key-value pairs. This data is human readable, which makes JSON perfect for manual editing. From this little snippet you can see that keys are wrapped in double quotes, a colon separates the key and the value, and the value can be of different types. Key-value sets are separated by a comma.
A JSON array contains zero, one, or more ordered elements, separated by a comma. The JSON array is surrounded by square brackets [ ] . A JSON array is zero terminated, the first index of the array is zero (0). Therefore, the last index of the array is length - 1.
JSON is perfect for storing temporary data that's consumed by the entity that creates the data. A good example is user-generated data such as filling out a form or information exchange between an API and an app.
The best database for JSON A JSON database like MongoDB stores the data in a JSON-like format (binary JSON), which is the binary encoded version of JSON, and is optimized for performance and space. This makes the MongoDB database the best natural fit for storing JSON data.
Although ClickHouse uses the fast JSON libraries (such as simdjson and rapidjson) to parsing I think the Nesting-fields should be faster.
If the JSON structure is fixed or be changed predictably try to consider the way of denormalizing data:
..
created_at DateTime,
updated_at DateTime,
additional_data_message Nullable(String),
additional_data_eventValue Nullable(String),
additional_data_rating Nullable(String),
additional_data_focalLength Nullable(Float64)
..
On one hand, it can significantly increase the count of rows and disk space, on another side, it should give a significant increase in performance (especially in the right indexing). Moreover, the disk size can be reduced using LowCardinality-type and Codecs.
avoid to use Nullable types, prefer to use some replacement such as '', 0, etc (see explanation Clickhouse string field disk usage: null vs empty)
UUID type doesn't give index monotonicity, this one should be much better (More secrets of ClickHouse Query Performance):
..
ORDER BY (created_at, uuid);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With