Background
I have a MySQL test environment with a table which contains over 200 million rows. On this table have to execute two types of queries;
client_id
and a list of sgtin
s, which can hold up to
50.000 items, I need to know which sgtin
s are present in the table.client_id
and a list of sgtin
s, which can hold up to
50.000 items, I need to fetch the full row. (store, gtin...)The table can grow to 200+ millions record for a single 'client_id'.
Test environment
Xeon E3-1545M / 32GB RAM / SSD. InnoDB buffer pool 24GB. (Production will be a larger server with 192GB RAM)
Table
CREATE TABLE `sgtins` (
`client_id` INT UNSIGNED NOT NULL,
`sgtin` varchar(255) NOT NULL,
`store` varchar(255) NOT NULL,
`gtin` varchar(255) NOT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
INDEX (`client_id`, `store`, `sgtin`),
INDEX (`client_id`),
PRIMARY KEY (`client_id`,`sgtin`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Tests
First I generated random sgtin values spread over 10 'client_id's to fill the table with 200 million rows.
I created a benchmark tool which executes various queries I tried. Also I used the explain plan to find out which performance best. This tool will read, for every test, new random data from the data I used to fill the database. To ensure every query is different.
For this post I will use 28 sgtin
s.
Temp table
CREATE TEMPORARY TABLE sgtins_tmp_table (`sgtin` varchar(255) primary key)
engine=MEMORY;
Exist query
I use this query for find out if the sgtin
s exist. Also this is the fastest query I found. For 50K sgtin
s this query will take between 3 and 9 seconds.
-- cost = 17 for 28 sgtins loaded in the temp table.
SELECT sgtin
FROM sgtins_tmp_table
WHERE EXISTS
(SELECT sgtin FROM sgtins
WHERE sgtins.client_id = 4
AND sgtins.sgtin = sgtins_tmp_table.sgtin);
Select queries
-- cost = 50.60 for 28 sgtins loaded in the temp table. 50K not usable.
SELECT sgtins.sgtin, sgtins.store, sgtins.timestamp
FROM sgtins_tmp_table, sgtins
WHERE sgtins.client_id = 4
AND sgtins_tmp_table.sgtin = sgtins.sgtin;
-- cost = 64 for 28 sgtins loaded in the temp table.
SELECT sgtins.sgtin, sgtins.store, sgtins.timestamp
FROM sgtins
WHERE sgtins.client_id = 4
AND sgtins.sgtin IN ( SELECT sgtins_tmp_table.sgtin
FROM sgtins_tmp_table);
-- cost = 50.60 for 28 sgtins loaded in the temp table.
SELECT sgtins_tmp_table.epc, sgtins.store
FROM sgtins_tmp_table, sgtins
WHERE exists (SELECT organization_id, sgtin FROM sgtins WHERE client_id = 4 AND sgtins.sgtin = sgtins_tmp_table.sgtin)
AND sgtins.client_id = 4
AND sgtins_tmp_table.sgtin = sgtins.sgtin;
Summary
The exist query is usable but the selects are to slow. What can I do about it? And any advice is welcome :)
1:- Check Indexes. 2:- There should be indexes on all fields used in the WHERE and JOIN portions of the SQL statement 3:- Limit Size of Your Working Data Set. 4:- Only Select Fields You select as Need. 5:- Remove Unnecessary Table and index 6:- Remove OUTER JOINS.
MongoDB is also considered to be the best database for large amounts of text and the best database for large data.
I would write your exists
query like this:
SELECT stt.sgtin
FROM sgtins_tmp_table stt
WHERE EXISTS (SELECT 1
FROM sgtins s
WHERE s.client_id = 4 AND
s.sgtin = stt.sgtin
);
For this query, you want an index on sgtins(sgtin, client_id)
.
i would suggest to rewite your EXISTS
SQL as corelated subqueries tends to optimize badly most off the time.
The suggested query would be to use a INNER JOIN
instead.
SELECT filter.sgtin
FROM (SELECT '<value>' AS sgtin UNION ALL SELECT '<value>' ..) AS filter
INNER JOIN sgtins ON filter.sgtin = sgtins.sgtin WHERE sgtins.client_id = 4
As most likely this is faster then using a temporary table.
But your are dealing with 50K values so i would make sense to generate the needed derived table SQL with dynamic SQL directly from the temporary table.
Also like i suggested in the chat.
Making a index (sgtins, client_id)
would most likely make more sense depending on the data selectivity which is not really clear.
As that index might make your corelated subquery faster.
Query
# Maybe also needed to be changed with 50 K
# SET SESSION max_allowed_packet = ??;
# needed for GROUP_CONCAT as if defualts to only 1024
SET SESSION group_concat_max_len = @@max_allowed_packet;
SET @UNION_SQL = NULL;
SELECT
CONCAT(
'SELECT '
, GROUP_CONCAT(
CONCAT("'", sgtins_tmp_table.sgtin,"'", ' AS sgtin')
SEPARATOR ' UNION ALL SELECT '
)
)
FROM
sgtins_tmp_table
INTO
@UNION_SQL;
SET @SQL = CONCAT("
SELECT filter.sgtin
FROM (",@UNION_SQL,") AS filter
INNER JOIN sgtins ON filter.sgtin = sgtins.sgtin WHERE sgtins.client_id = 4
");
PREPARE q FROM @SQL;
EXECUTE q;
see demo
Editted because of comments
A more ideal approach would be using a fixed table which you index and use CONNECTION_ID()
to separate the search values.
CREATE TABLE sgtins_filter (
connection_id INT
, sgtin varchar(255) NOT NULL
, INDEX(connection_id, sgtin)
);
Then you can simply join between both tables
SELECT sgtins_filter.sgtin
FROM sgtins_filter
INNER JOIN sgtins
ON
sgtins_filter.sgtin = sgtins.sgtin
AND
sgtins_filter.connection_id = CONNECTION_ID()
AND
sgtins.client_id = 4;
see demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With