Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How make JOIN table in ClickHouse DB faster?

Tags:

clickhouse

I have two tables

event

  • id
  • os

params

  • id
  • sx
  • sy

This table have relation 1-1 by id. If execute query

select count(*)
from
(select id from event where os like 'Android%')
inner join
(select id from params where sx >= 1024)
using id

they very slow

But if all data contains in one table

select count(*) from event where sx >= 1024 and os like 'Android%'

Query executed very fast.

Please, tell me how use join in ClickHouse DB effective? Keep all data in one table is not convenient.

like image 828
Oleg Khamov Avatar asked Nov 02 '16 12:11

Oleg Khamov


People also ask

Does ClickHouse support joins?

Supported Types of JOIN​All standard SQL JOIN types are supported: INNER JOIN , only matching rows are returned. LEFT OUTER JOIN , non-matching rows from left table are returned in addition to matching rows. RIGHT OUTER JOIN , non-matching rows from right table are returned in addition to matching rows.

How fast is ClickHouse?

Under the same conditions, ClickHouse can handle several hundred queries per second on a single server (up to several thousand in the best case). Since this scenario is not typical for analytical DBMSs, we recommend expecting a maximum of 100 queries per second.

Can we use joins in materialized view?

Fast refresh for a materialized view containing only joins is possible after any type of DML to the base tables (direct-path or conventional INSERT , UPDATE , or DELETE ). A materialized view containing only joins can be defined to be refreshed ON COMMIT or ON DEMAND .

How do I create a temporary table in ClickHouse?

From SELECT query​CREATE TABLE [IF NOT EXISTS] [db.] table_name[(name1 [type1], name2 [type2], ...)] ENGINE = engine AS SELECT ... Creates a table with a structure like the result of the SELECT query, with the engine engine, and fills it with data from SELECT .

Is ClickHouse a time series?

Clickhouse is an analytic column-based RDBMS. It's not a timeseries database. Each class of product is used to solve different problems. Time-series data is just data where every record has a "time" field.


2 Answers

I experience same problems with joining 2 huge distributed tables. There are 2 main problems

  • durion of executing
  • limits by needed memory for a query.

What works for me is sharding calculation query by id%N using subqueries and union all results then.

SELECT count(*)
FROM
(
    SELECT 1
    FROM event
    WHERE id%2=0 AND id IN
    (
        SELECT id
        FROM params
        WHERE id % 2 = 0 AND sx >= 1024
    )
    UNION ALL
    SELECT 2
    FROM event
    WHERE id % 2 = 1 AND id IN
    (
        SELECT id
        FROM params
        WHERE id % 2 = 1 AND sx >= 1024
    )
)

You can change id%N(2 in the example) until you get needed performance. Need to replace IN to GLOBAL IN if you use distributed engines for tables.

like image 147
Evgeniy Skomorokhov Avatar answered Sep 24 '22 14:09

Evgeniy Skomorokhov


You may rewrite query like this:

select count(*)
from event 
where os like 'Android%' 
AND id IN (select id from params where sx >= 1024)
like image 42
uYSIZfoz Avatar answered Sep 23 '22 14:09

uYSIZfoz