I would like to use ClickHouse for marketing. Most of the time they not just want to know HOW much people use some feature but the exact emails to send spam to. Is that a good choice to use ClickHouse for such purpose (select DISTINCT email from table where ...)? What is the difference in performance between 'select COUNT' and 'select DISTINCT'?
Is that a good choice to use ClickHouse for such purpose
Yes, ClickHouse has decent HashTable and Aggregator implementations. It heavily uses templated code for static type dispatching and applies a lot of memory tricks. And it stores data in a compact form.
I assume you'd like to compare select count
and select count(distinct)
as select distinct
is a different beast. ClickHouse transforms count(distinct)
into aggregator uniqExact
which is about 8 times slower than count(*)
, but is still much faster than tranditional databases like Postgres. There are also approximate aggregators uniq
, uniqCombined
and uniqHLL12
for faster estimations, which is around 1.5 times slower than count(*)
. See https://clickhouse.yandex/docs/en/query_language/agg_functions/reference/ if you need more info.
If your goal is select distinct
, ClickHouse can still do it well, which uses Set
data structure to uniquify the data streams (Set
is also used for building its SQL in (...)
construct). Without measuring the data output process, it's only 1.3x slower than plain count(*)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With