Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How understand the granularity and block in ClickHouse?

Tags:

clickhouse

I am not clear about these two words. Whether does one block have a fixed number of rows? Whether is one block the minimum unit to read from disk? Whether are different blocks stored in different files? Whether is the range of one block bigger than granule? That means, one block can have several granules skip indices.

like image 377
GOGO Avatar asked Feb 17 '20 04:02

GOGO


People also ask

What is granularity in ClickHouse?

A granule is the smallest indivisible data set that ClickHouse reads when selecting data. ClickHouse does not split rows or values, so each granule always contains an integer number of rows. The first row of a granule is marked with the value of the primary key for the row.

What is ClickHouse MergeTree?

The MergeTree table engine is mainly used to analyze large amounts of data. It supports features, such as data partitioning, storage ordering, primary key indexing, sparse indexing, data TTL. MergeTree supports all ClickHouse SQL syntaxes with some features different from MySQL.

How do I create an index in ClickHouse?

Adding an index can be easily done with the ALTER TABLE ADD INDEX statement. After the index is added, only new incoming data will get indexed. Clickhouse provides ALTER TABLE [db.] table MATERIALIZE INDEX name IN PARTITION partition_name statement to rebuild the index in an existing partition.


1 Answers

https://clickhouse.tech/docs/en/operations/table_engines/mergetree/#primary-keys-and-indexes-in-queries

Primary key is sparsed. By default it contains 1 value of each 8192 rows (= 1 granule).

Let's disable adaptive granularity (for the test) -- index_granularity_bytes=0

create table X (A Int64) 
Engine=MergeTree order by A 
settings index_granularity=16,index_granularity_bytes=0;

insert into X select * from numbers(32);

index_granularity=16 -- 32 rows = 2 granule , primary index have 2 values 0 and 16

select marks, primary_key_bytes_in_memory from system.parts where table = 'X';
┌─marks─┬─primary_key_bytes_in_memory─┐
│     2 │                          16 │
└───────┴─────────────────────────────┘

16 bytes === 2 values of INT64.

Adaptive index granularity means that granules size various. Because wide rows (many bytes) needs (for performance) fewer (<8192) rows in granule.

index_granularity_bytes = 10MB ~ 1k row * 8129. So each granule have 10MB. If rows size 100k (long Strings), granule will have 100 rows (not 8192).


Skip index granules GRANULARITY 3 -- means that an index will store one value for each 3 table granules.

create table X (A Int64, B Int64, INDEX IX1 (B) TYPE minmax GRANULARITY 4) 
Engine=MergeTree order by A 
settings index_granularity=16,index_granularity_bytes=0;

insert into X select number, number from numbers(128);

128/16 = 8, table have 8 granules, INDEX IX1 stores 2 values of minmax (8/4)

So minmax index stores 2 values -- (0..63) and (64..128)

0..63 -- points to the first 4 table's granules.

64..128 -- points to the second 4 table' granules.

set send_logs_level='debug'
select * from X where B=77
[ 84 ] <Debug> dw.X (SelectExecutor): **Index `IX1` has dropped 1 granules**
[ 84 ] <Debug> dw.X (SelectExecutor): Selected 1 parts by date, 1 parts by key, **4 marks** to read from 1 ranges

SelectExecutor checked skip index - 4 table granules can be skipped because 77 is not in 0..63 . And another 4 granules must be read ( 4 marks ) because 77 in (64..128) -- some of that 4 granules have B=77.

like image 174
Denny Crane Avatar answered Sep 24 '22 06:09

Denny Crane