I try to improve performance of a SQL query, using MariaDB 10.1.18 (Linux Debian Jessie).
The server has a large amount of RAM (192GB) and SSD disks.
The real table has hundreds of millions of rows but I can reproduce my performance issue on a subset of the data and a simplified layout.
Here is the (simplified) table definition:
CREATE TABLE `data` (
`uri` varchar(255) NOT NULL,
`category` tinyint(4) NOT NULL,
`value` varchar(255) NOT NULL,
PRIMARY KEY (`uri`,`category`),
KEY `cvu` (`category`,`value`,`uri`),
KEY `cu` (`category`,`uri`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
To reproduce the actual distribution of my content, I insert about 200'000 rows like this (bash script):
#!/bin/bash
for i in `seq 1 100000`;
do
mysql mydb -e "INSERT INTO data (uri, category, value) VALUES ('uri${i}', 1, 'foo');"
done
for i in `seq 99981 200000`;
do
mysql mydb -e "INSERT INTO data (uri, category, value) VALUES ('uri${i}', 2, '$(($i % 5))');"
done
So, we insert about:
I always run an ANALYZE TABLE before querying.
Here is the explain output of the query I run:
MariaDB [mydb]> EXPLAIN EXTENDED
-> SELECT d2.uri, d2.value
-> FROM data as d1
-> INNER JOIN data as d2 ON d1.uri = d2.uri AND d2.category = 2
-> WHERE d1.category = 1 and d1.value = 'foo';
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
| 1 | SIMPLE | d1 | ref | PRIMARY,cvu,cu | cu | 1 | const | 92964 | 100.00 | Using where |
| 1 | SIMPLE | d2 | eq_ref | PRIMARY,cvu,cu | PRIMARY | 768 | mydb.d1.uri,const | 1 | 100.00 | |
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
MariaDB [mydb]> SHOW WARNINGS;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note | 1003 | select `mydb`.`d2`.`uri` AS `uri`,`mydb`.`d2`.`value` AS `value` from `mydb`.`data` `d1` join `mydb`.`data` `d2` where ((`mydb`.`d1`.`category` = 1) and (`mydb`.`d2`.`uri` = `mydb`.`d1`.`uri`) and (`mydb`.`d2`.`category` = 2) and (`mydb`.`d1`.`value` = 'foo')) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
MariaDB [mydb]> SELECT d2.uri, d2.value FROM data as d1 INNER JOIN data as d2 ON d1.uri = d2.uri AND d2.category = 2 WHERE d1.category = 1 and d1.value = 'foo';
+-----------+-------+
| uri | value |
+-----------+-------+
| uri100000 | 0 |
| uri99981 | 1 |
| uri99982 | 2 |
| uri99983 | 3 |
| uri99984 | 4 |
| uri99985 | 0 |
| uri99986 | 1 |
| uri99987 | 2 |
| uri99988 | 3 |
| uri99989 | 4 |
| uri99990 | 0 |
| uri99991 | 1 |
| uri99992 | 2 |
| uri99993 | 3 |
| uri99994 | 4 |
| uri99995 | 0 |
| uri99996 | 1 |
| uri99997 | 2 |
| uri99998 | 3 |
| uri99999 | 4 |
+-----------+-------+
20 rows in set (0.35 sec)
This query returns 20 rows in ~350ms.
It seems quite slow to me.
Is there a way to improve performance of such query? Any advice?
Can you try the following query?
SELECT dd.uri, max(case when dd.category=2 then dd.value end) v2
FROM data as dd
GROUP by 1
having max(case when dd.category=1 then dd.value end)='foo' and v2 is not null;
I cannot at the moment repeat your test, but my hope is that having to scan the table just once could compensate the usage of the aggregate functions.
Edited
Created a test environment and tested some hypothesis. As of today, the best performance (for 1 million rows) has been:
1 - Adding an index on uri column
2 - Using the following query
select d2.uri, d2.value
FROM data as d2
where exists (select 1
from data d1
where d1.uri = d2.uri
AND d1.category = 1
and d1.value='foo')
and d2.category=2
and d2.uri in (select uri from data group by 1 having count(*) > 1);
The ironic thing is that in the first proposal I tried to minimize the access to the table and now I'm proposing three accesses.
Edited: 30/10
Ok, so I've done some other experiments and I would like to summarize the outcomes. First, I'd like to expand a bit Aruna answer: what I found interesting in the OP question, is that it is an exception to a classic "rule of thumb" in database optimization: if the # of desired results is very small compared to the dimension of the tables involved, it should be possible with the correct indexes to have a very good performance.
Why can't we simply add a "magic index" to have our 20 rows? Because we don't have any clear "attack vector".. I mean, there's no clearly selective criteria we can apply on a record to reduce significatevely the number of the target rows.
Think about it: the fact that the value must be "foo" is just removing 50% of the table form the equation. Also the category is not selective at all: the only interest thing is that, for 20 uri, they appear both in records with category 1 and 2.
But here lies the issue: the condition involves comparing two rows, and unfortunately, to my knowledge, there's no way an index (not even the Oracle Function Based Indexes) can reduce a condition that is dependant on info on multiple rows.
The conlclusion might be: if these kind of query is what you need, you should revise your data model. For example, if you have a finite and small number of categories (lets' say three=, your table might be written as:
uri, value_category1, value_category2, value_category3
The query would be:
select uri, value_category2 where value_category1='foo' and value_category2 is not null;
By the way, let's go back tp the original question. I've created a slightly more efficient test data generator (http://pastebin.com/DP8Uaj2t).
I've used this table:
use mydb;
DROP TABLE IF EXISTS data2;
CREATE TABLE data2
(
uri varchar(255) NOT NULL,
category tinyint(4) NOT NULL,
value varchar(255) NOT NULL,
PRIMARY KEY (uri,category),
KEY cvu (category,value,uri),
KEY ucv (uri,category,value),
KEY u (uri),
KEY cu (category,uri)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The outcome is:
+--------------------------+----------+----------+----------+
| query_descr | num_rows | num | num_test |
+--------------------------+----------+----------+----------+
| exists_plus_perimeter | 10000 | 0.0000 | 5 |
| exists_plus_perimeter | 50000 | 0.0000 | 5 |
| exists_plus_perimeter | 100000 | 0.0000 | 5 |
| exists_plus_perimeter | 500000 | 2.0000 | 5 |
| exists_plus_perimeter | 1000000 | 4.8000 | 5 |
| exists_plus_perimeter | 5000000 | 26.7500 | 8 |
| max_based | 10000 | 0.0000 | 5 |
| max_based | 50000 | 0.0000 | 5 |
| max_based | 100000 | 0.0000 | 5 |
| max_based | 500000 | 3.2000 | 5 |
| max_based | 1000000 | 7.0000 | 5 |
| max_based | 5000000 | 49.5000 | 8 |
| max_based_with_ucv | 10000 | 0.0000 | 5 |
| max_based_with_ucv | 50000 | 0.0000 | 5 |
| max_based_with_ucv | 100000 | 0.0000 | 5 |
| max_based_with_ucv | 500000 | 2.6000 | 5 |
| max_based_with_ucv | 1000000 | 7.0000 | 5 |
| max_based_with_ucv | 5000000 | 36.3750 | 8 |
| standard_join | 10000 | 0.0000 | 5 |
| standard_join | 50000 | 0.4000 | 5 |
| standard_join | 100000 | 2.4000 | 5 |
| standard_join | 500000 | 13.4000 | 5 |
| standard_join | 1000000 | 33.2000 | 5 |
| standard_join | 5000000 | 205.2500 | 8 |
| standard_join_plus_perim | 5000000 | 155.0000 | 2 |
+--------------------------+----------+----------+----------+
The queries used are:
- query_max_based_with_ucv.sql
- query_exists_plus_perimeter.sql
- query_max_based.sql
- query_max_based_with_ucv.sql
- query_standard_join_plus_perim.sql query_standard_join.sql
The best query is still the "query_exists_plus_perimeter"that I've put after the first environment creation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With