Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SELECT queries performance impact when the Clickhouse table is continuously populated with INSERT INTO

The Clickhouse table, MergeTree Engine, is continuously populated with “INSERT INTO … FORMAT CSV” queries, starting empty. The average input rate is 7000 rows per sec. The insertion is happening in batches of few thousand rows. This has severe performance impact when SELECT queries are executed concurrently. As described in the Clickhouse documentation, the system needs at most 10 minutes to merge the data of a specific table (re-index). But this is not happening as the table is continuously populated.

This is also evident in the file system. The table folder has thousands of sub-folders and the index is over-segmented. If the data ingestion stops, after a few minutes the table is fully merged, and the number of sub-folders becomes a dozen.

In order to encounter the above weakness, the Buffer Engine was used to buffer the table data ingestion for 10 minutes. Consequently, the buffer maximum number of rows is on average 4200000.

The initial table is remaining at most 10 minutes behind as the buffer is keeping the most recently ingested rows. The table is finally merged, and the behaviour is the same as in case where the table has stopped to be populated for a few minutes. But the Buffer table, which corresponds to the combination of the buffer and the initial table, is getting severely slower.

From the above appears that, if the table is continuously populated, it is not merging, and indexing suffers. Is there a way to avoid this weakness?

like image 923
manolis.alivizatos Avatar asked Dec 10 '25 12:12

manolis.alivizatos


1 Answers

The number of sub-folders in the table data directory is not so representative value.

Indeed, each sub-folder contains a data part consisting of sorted (indexed) rows. If several data parts are merged into a new bigger one the new sub-folder appears.

However, source data parts are not removed instantly after the merge. There is a <merge_tree> setting old_parts_lifetime defining a delay after which the parts will be removed, by default it set to 8 minutes. Also, there is cleanup_delay_period setting defining how often a background cleaner checks and removes outdated parts, it is 30 seconds by default.

So, it is normal to have such amount of sub-folders for about 8 minutes and 30 seconds after the ingestion starts. If it is unacceptable to you, you can change these settings.

It makes sense to check the amount of active parts in a table only (i.e. parts which have not been merged into a bigger one). To do so, you could run the following query: SELECT count() FROM system.parts WHERE database='db' AND table='table' AND active.

Moreover, ClickHouse does such checks internally if the amount of active parts in a partition is greater than parts_to_delay_insert=150, it will slow down INSERTs, but if it is greater than parts_to_throw_insert=300 it will abort insertions.

like image 177
Vitaliy L. Avatar answered Dec 14 '25 01:12

Vitaliy L.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!