Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ClickHouse: How to store JSON data the right way?

I'm going to migrate data from PostgreSQL database to Yandex's ClickHouse. One of the fields in a source table is of type JSON - called additional_data. So, PostgreSQL allows me to access json attributes during e.g. SELECT ... queries with ->> and -> and so on.

I need the same behavior to persist in my resulting table in ClickHouse storage. (i.e. the ability to parse JSON during select queries and/or when using filtering and aggregation clauses)

Here is what I've done during CREATE TABLE ... in ClickHouse client:

create table if not exists analytics.events
(
    uuid UUID,
    ...,
    created_at DateTime,
    updated_at DateTime,
    additional_data Nested (
        message Nullable(String),
        eventValue Nullable(String),
        rating Nullable(String),
        focalLength Nullable(Float64)
        )
)
engine = MergeTree

ORDER BY (uuid, created_at)
PRIMARY KEY uuid;

Is that a good choice how to store JSON-serializable data? Any Ideas?

Maybe It's better to store a JSON data as a plain String instead of Nested and playing with It using special functions?

like image 420
arturkuchynski Avatar asked Sep 30 '20 06:09

arturkuchynski


People also ask

How should a data value stored in JSON?

Data is stored in a set of key-value pairs. This data is human readable, which makes JSON perfect for manual editing. From this little snippet you can see that keys are wrapped in double quotes, a colon separates the key and the value, and the value can be of different types. Key-value sets are separated by a comma.

What is the correct way to write a JSON array?

A JSON array contains zero, one, or more ordered elements, separated by a comma. The JSON array is surrounded by square brackets [ ] . A JSON array is zero terminated, the first index of the array is zero (0). Therefore, the last index of the array is length - 1.

Is JSON the best way to store data?

JSON is perfect for storing temporary data that's consumed by the entity that creates the data. A good example is user-generated data such as filling out a form or information exchange between an API and an app.

What is the best database to store JSON?

The best database for JSON A JSON database like MongoDB stores the data in a JSON-like format (binary JSON), which is the binary encoded version of JSON, and is optimized for performance and space. This makes the MongoDB database the best natural fit for storing JSON data.


1 Answers

  1. Although ClickHouse uses the fast JSON libraries (such as simdjson and rapidjson) to parsing I think the Nesting-fields should be faster.

  2. If the JSON structure is fixed or be changed predictably try to consider the way of denormalizing data:

..
    created_at DateTime,
    updated_at DateTime,
    additional_data_message Nullable(String),
    additional_data_eventValue Nullable(String),
    additional_data_rating Nullable(String),
    additional_data_focalLength Nullable(Float64)
..

On one hand, it can significantly increase the count of rows and disk space, on another side, it should give a significant increase in performance (especially in the right indexing). Moreover, the disk size can be reduced using LowCardinality-type and Codecs.

  1. Some others remarks:
  • avoid to use Nullable types, prefer to use some replacement such as '', 0, etc (see explanation Clickhouse string field disk usage: null vs empty)

  • UUID type doesn't give index monotonicity, this one should be much better (More secrets of ClickHouse Query Performance):

..
ORDER BY (created_at, uuid);
  • consider using Aggregating-engines to significantly increase the speed of calculation aggregated values
  1. In any case before making a final decision need to do manual testing on a data subset (this applies as to choose the schema (json as string/Nested type/denormalized way), as choosing the column codec).
like image 57
vladimir Avatar answered Sep 30 '22 17:09

vladimir