Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is 'distinct' an ordinary operation for ClickHouse?

Tags:

clickhouse

I would like to use ClickHouse for marketing. Most of the time they not just want to know HOW much people use some feature but the exact emails to send spam to. Is that a good choice to use ClickHouse for such purpose (select DISTINCT email from table where ...)? What is the difference in performance between 'select COUNT' and 'select DISTINCT'?

like image 503
Eugene To Avatar asked Oct 29 '18 18:10

Eugene To


1 Answers

Is that a good choice to use ClickHouse for such purpose

Yes, ClickHouse has decent HashTable and Aggregator implementations. It heavily uses templated code for static type dispatching and applies a lot of memory tricks. And it stores data in a compact form.

I assume you'd like to compare select count and select count(distinct) as select distinct is a different beast. ClickHouse transforms count(distinct) into aggregator uniqExact which is about 8 times slower than count(*), but is still much faster than tranditional databases like Postgres. There are also approximate aggregators uniq, uniqCombined and uniqHLL12 for faster estimations, which is around 1.5 times slower than count(*). See https://clickhouse.yandex/docs/en/query_language/agg_functions/reference/ if you need more info.

If your goal is select distinct, ClickHouse can still do it well, which uses Set data structure to uniquify the data streams (Set is also used for building its SQL in (...) construct). Without measuring the data output process, it's only 1.3x slower than plain count(*).

like image 63
Amos Avatar answered Oct 15 '22 14:10

Amos