Improve self-JOIN SQL Query performance

Question

I try to improve performance of a SQL query, using MariaDB 10.1.18 (Linux Debian Jessie).

The server has a large amount of RAM (192GB) and SSD disks.

The real table has hundreds of millions of rows but I can reproduce my performance issue on a subset of the data and a simplified layout.

Here is the (simplified) table definition:

CREATE TABLE `data` (
  `uri` varchar(255) NOT NULL,
  `category` tinyint(4) NOT NULL,
  `value` varchar(255) NOT NULL,
  PRIMARY KEY (`uri`,`category`),
  KEY `cvu` (`category`,`value`,`uri`),
  KEY `cu` (`category`,`uri`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

To reproduce the actual distribution of my content, I insert about 200'000 rows like this (bash script):

#!/bin/bash
for i in `seq 1 100000`;
do
  mysql mydb -e "INSERT INTO data (uri, category, value) VALUES ('uri${i}', 1, 'foo');"
done

for i in `seq 99981 200000`;
do
  mysql mydb -e "INSERT INTO data (uri, category, value) VALUES ('uri${i}', 2, '$(($i % 5))');"
done

So, we insert about:

100'000 rows in category 1 with a static string ("foo") as value
100'000 rows in category 2 with a number between 1 and 5 as the value
20 rows have a common "uri" between each dataset (category 1 / 2)

I always run an ANALYZE TABLE before querying.

Here is the explain output of the query I run:

MariaDB [mydb]> EXPLAIN EXTENDED
    -> SELECT d2.uri, d2.value
    -> FROM data as d1
    -> INNER JOIN data as d2 ON d1.uri  = d2.uri AND d2.category = 2
    -> WHERE d1.category = 1 and d1.value  = 'foo';
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
| id   | select_type | table | type   | possible_keys  | key     | key_len | ref               | rows  | filtered | Extra       |
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
|    1 | SIMPLE      | d1    | ref    | PRIMARY,cvu,cu | cu      | 1       | const             | 92964 |   100.00 | Using where |
|    1 | SIMPLE      | d2    | eq_ref | PRIMARY,cvu,cu | PRIMARY | 768     | mydb.d1.uri,const |     1 |   100.00 |             |
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)

MariaDB [mydb]> SHOW WARNINGS;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                                                                                              |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | select `mydb`.`d2`.`uri` AS `uri`,`mydb`.`d2`.`value` AS `value` from `mydb`.`data` `d1` join `mydb`.`data` `d2` where ((`mydb`.`d1`.`category` = 1) and (`mydb`.`d2`.`uri` = `mydb`.`d1`.`uri`) and (`mydb`.`d2`.`category` = 2) and (`mydb`.`d1`.`value` = 'foo')) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

MariaDB [mydb]> SELECT d2.uri, d2.value FROM data as d1 INNER JOIN data as d2 ON d1.uri  = d2.uri AND d2.category = 2 WHERE d1.category = 1 and d1.value  = 'foo';
+-----------+-------+
| uri       | value |
+-----------+-------+
| uri100000 | 0     |
| uri99981  | 1     |
| uri99982  | 2     |
| uri99983  | 3     |
| uri99984  | 4     |
| uri99985  | 0     |
| uri99986  | 1     |
| uri99987  | 2     |
| uri99988  | 3     |
| uri99989  | 4     |
| uri99990  | 0     |
| uri99991  | 1     |
| uri99992  | 2     |
| uri99993  | 3     |
| uri99994  | 4     |
| uri99995  | 0     |
| uri99996  | 1     |
| uri99997  | 2     |
| uri99998  | 3     |
| uri99999  | 4     |
+-----------+-------+
20 rows in set (0.35 sec)

This query returns 20 rows in ~350ms.

It seems quite slow to me.

Is there a way to improve performance of such query? Any advice?

Insac · Accepted Answer

Can you try the following query?

  SELECT dd.uri, max(case when dd.category=2 then dd.value end) v2
    FROM data as dd
   GROUP by 1 
  having max(case when dd.category=1 then dd.value end)='foo' and v2 is not null;

I cannot at the moment repeat your test, but my hope is that having to scan the table just once could compensate the usage of the aggregate functions.

Edited

Created a test environment and tested some hypothesis. As of today, the best performance (for 1 million rows) has been:

1 - Adding an index on uri column

2 - Using the following query

 select d2.uri, d2.value 
   FROM data as d2 
  where exists (select 1 
                  from data d1 
                 where d1.uri  = d2.uri 
                   AND d1.category = 1 
                   and d1.value='foo') 
    and d2.category=2 
    and d2.uri in (select uri from data group by 1 having count(*) > 1);

The ironic thing is that in the first proposal I tried to minimize the access to the table and now I'm proposing three accesses.

Edited: 30/10

Ok, so I've done some other experiments and I would like to summarize the outcomes. First, I'd like to expand a bit Aruna answer: what I found interesting in the OP question, is that it is an exception to a classic "rule of thumb" in database optimization: if the # of desired results is very small compared to the dimension of the tables involved, it should be possible with the correct indexes to have a very good performance.

Why can't we simply add a "magic index" to have our 20 rows? Because we don't have any clear "attack vector".. I mean, there's no clearly selective criteria we can apply on a record to reduce significatevely the number of the target rows.

Think about it: the fact that the value must be "foo" is just removing 50% of the table form the equation. Also the category is not selective at all: the only interest thing is that, for 20 uri, they appear both in records with category 1 and 2.

But here lies the issue: the condition involves comparing two rows, and unfortunately, to my knowledge, there's no way an index (not even the Oracle Function Based Indexes) can reduce a condition that is dependant on info on multiple rows.

The conlclusion might be: if these kind of query is what you need, you should revise your data model. For example, if you have a finite and small number of categories (lets' say three=, your table might be written as:

uri, value_category1, value_category2, value_category3

The query would be:

select uri, value_category2 where value_category1='foo' and value_category2 is not null;

By the way, let's go back tp the original question. I've created a slightly more efficient test data generator (http://pastebin.com/DP8Uaj2t).

I've used this table:

 use mydb;
 DROP TABLE IF EXISTS data2;

 CREATE TABLE data2 
 ( 
  uri varchar(255) NOT NULL, 
  category tinyint(4) NOT NULL, 
  value varchar(255) NOT NULL, 
  PRIMARY KEY (uri,category), 
  KEY cvu (category,value,uri), 
  KEY ucv (uri,category,value), 
  KEY u (uri), 
  KEY cu (category,uri)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

The outcome is:

 +--------------------------+----------+----------+----------+
 | query_descr              | num_rows | num      | num_test |
 +--------------------------+----------+----------+----------+
 | exists_plus_perimeter    |    10000 |   0.0000 |        5 |
 | exists_plus_perimeter    |    50000 |   0.0000 |        5 |
 | exists_plus_perimeter    |   100000 |   0.0000 |        5 |
 | exists_plus_perimeter    |   500000 |   2.0000 |        5 |
 | exists_plus_perimeter    |  1000000 |   4.8000 |        5 |
 | exists_plus_perimeter    |  5000000 |  26.7500 |        8 |
 | max_based                |    10000 |   0.0000 |        5 |
 | max_based                |    50000 |   0.0000 |        5 |
 | max_based                |   100000 |   0.0000 |        5 |
 | max_based                |   500000 |   3.2000 |        5 |
 | max_based                |  1000000 |   7.0000 |        5 |
 | max_based                |  5000000 |  49.5000 |        8 |
 | max_based_with_ucv       |    10000 |   0.0000 |        5 |
 | max_based_with_ucv       |    50000 |   0.0000 |        5 |
 | max_based_with_ucv       |   100000 |   0.0000 |        5 |
 | max_based_with_ucv       |   500000 |   2.6000 |        5 |
 | max_based_with_ucv       |  1000000 |   7.0000 |        5 |
 | max_based_with_ucv       |  5000000 |  36.3750 |        8 |
 | standard_join            |    10000 |   0.0000 |        5 |
 | standard_join            |    50000 |   0.4000 |        5 |
 | standard_join            |   100000 |   2.4000 |        5 |
 | standard_join            |   500000 |  13.4000 |        5 |
 | standard_join            |  1000000 |  33.2000 |        5 |
 | standard_join            |  5000000 | 205.2500 |        8 |
 | standard_join_plus_perim |  5000000 | 155.0000 |        2 |
 +--------------------------+----------+----------+----------+

The queries used are: - query_max_based_with_ucv.sql
- query_exists_plus_perimeter.sql
- query_max_based.sql
- query_max_based_with_ucv.sql
- query_standard_join_plus_perim.sql query_standard_join.sql

The best query is still the "query_exists_plus_perimeter"that I've put after the first environment creation.

Improve self-JOIN SQL Query performance

Tags:

performance

mysql

mariadb

Nicolas Payart

1 Answers

Insac

Recent Activity

Donate For Us

Improve self-JOIN SQL Query performance

Tags:

performance

mysql

mariadb

Nicolas Payart

1 Answers

Insac

Related questions

Recent Activity

Donate For Us