Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I improve query performance for 200+ million records

Background

I have a MySQL test environment with a table which contains over 200 million rows. On this table have to execute two types of queries;

  1. Do certain rows exists. Given a client_id and a list of sgtins, which can hold up to 50.000 items, I need to know which sgtins are present in the table.
  2. Select those rows. Given a client_id and a list of sgtins, which can hold up to 50.000 items, I need to fetch the full row. (store, gtin...)

The table can grow to 200+ millions record for a single 'client_id'.

Test environment

Xeon E3-1545M / 32GB RAM / SSD. InnoDB buffer pool 24GB. (Production will be a larger server with 192GB RAM)

Table

CREATE TABLE `sgtins` (
  `client_id` INT UNSIGNED NOT NULL,
  `sgtin` varchar(255) NOT NULL,
  `store` varchar(255) NOT NULL,
  `gtin` varchar(255) NOT NULL,
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  INDEX (`client_id`, `store`, `sgtin`),
  INDEX (`client_id`),
  PRIMARY KEY (`client_id`,`sgtin`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Tests

First I generated random sgtin values spread over 10 'client_id's to fill the table with 200 million rows.

I created a benchmark tool which executes various queries I tried. Also I used the explain plan to find out which performance best. This tool will read, for every test, new random data from the data I used to fill the database. To ensure every query is different.

For this post I will use 28 sgtins. Temp table

CREATE TEMPORARY TABLE sgtins_tmp_table (`sgtin` varchar(255) primary key)
 engine=MEMORY;

Exist query

I use this query for find out if the sgtins exist. Also this is the fastest query I found. For 50K sgtins this query will take between 3 and 9 seconds.

-- cost = 17 for 28 sgtins loaded in the temp table.
SELECT sgtin
FROM sgtins_tmp_table
WHERE EXISTS 
  (SELECT sgtin FROM sgtins 
  WHERE sgtins.client_id = 4 
  AND sgtins.sgtin = sgtins_tmp_table.sgtin);

Explain plan

Select queries

-- cost = 50.60 for 28 sgtins loaded in the temp table. 50K not usable.
SELECT sgtins.sgtin, sgtins.store, sgtins.timestamp
FROM sgtins_tmp_table, sgtins
WHERE sgtins.client_id = 4
AND sgtins_tmp_table.sgtin = sgtins.sgtin;

Explain plan

-- cost = 64 for 28 sgtins loaded in the temp table.
SELECT sgtins.sgtin, sgtins.store, sgtins.timestamp
FROM sgtins
WHERE sgtins.client_id = 4
AND sgtins.sgtin IN ( SELECT sgtins_tmp_table.sgtin
 FROM sgtins_tmp_table);

Explain plan

-- cost = 50.60 for 28 sgtins loaded in the temp table.
SELECT sgtins_tmp_table.epc, sgtins.store
FROM sgtins_tmp_table, sgtins
WHERE exists (SELECT organization_id, sgtin FROM sgtins WHERE client_id = 4 AND sgtins.sgtin = sgtins_tmp_table.sgtin)
AND sgtins.client_id = 4
AND sgtins_tmp_table.sgtin = sgtins.sgtin;

Explain plan

Summary

The exist query is usable but the selects are to slow. What can I do about it? And any advice is welcome :)

like image 855
Mark Ebbers Avatar asked Jun 13 '19 08:06

Mark Ebbers


People also ask

How optimize SQL query with millions of records?

1:- Check Indexes. 2:- There should be indexes on all fields used in the WHERE and JOIN portions of the SQL statement 3:- Limit Size of Your Working Data Set. 4:- Only Select Fields You select as Need. 5:- Remove Unnecessary Table and index 6:- Remove OUTER JOINS.

Which database is best for millions of records?

MongoDB is also considered to be the best database for large amounts of text and the best database for large data.


Video Answer


2 Answers

I would write your exists query like this:

SELECT stt.sgtin
FROM sgtins_tmp_table stt
WHERE EXISTS (SELECT 1
              FROM sgtins s
              WHERE s.client_id = 4 AND
                    s.sgtin = stt.sgtin
             );

For this query, you want an index on sgtins(sgtin, client_id).

like image 166
Gordon Linoff Avatar answered Sep 22 '22 00:09

Gordon Linoff


i would suggest to rewite your EXISTS SQL as corelated subqueries tends to optimize badly most off the time.
The suggested query would be to use a INNER JOIN instead.

SELECT filter.sgtin
FROM (SELECT '<value>' AS sgtin UNION ALL SELECT '<value>' ..) AS filter
INNER JOIN sgtins ON filter.sgtin = sgtins.sgtin WHERE sgtins.client_id = 4

As most likely this is faster then using a temporary table.
But your are dealing with 50K values so i would make sense to generate the needed derived table SQL with dynamic SQL directly from the temporary table.

Also like i suggested in the chat.
Making a index (sgtins, client_id) would most likely make more sense depending on the data selectivity which is not really clear.
As that index might make your corelated subquery faster.

Query

# Maybe also needed to be changed with 50 K 
# SET SESSION max_allowed_packet = ??; 


# needed for GROUP_CONCAT as if defualts to only 1024 
SET SESSION group_concat_max_len = @@max_allowed_packet;

SET @UNION_SQL = NULL;

SELECT
  CONCAT(
       'SELECT '
    ,  GROUP_CONCAT(
          CONCAT("'", sgtins_tmp_table.sgtin,"'", ' AS sgtin')
          SEPARATOR ' UNION ALL SELECT '
       )
  )
FROM
 sgtins_tmp_table
INTO
 @UNION_SQL;


SET @SQL = CONCAT("
SELECT filter.sgtin
FROM (",@UNION_SQL,") AS filter
INNER JOIN sgtins ON filter.sgtin = sgtins.sgtin WHERE sgtins.client_id = 4
");


PREPARE q FROM @SQL;
EXECUTE q;

see demo

Editted because of comments

A more ideal approach would be using a fixed table which you index and use CONNECTION_ID() to separate the search values.

CREATE TABLE sgtins_filter (
    connection_id INT
  , sgtin varchar(255) NOT NULL
  , INDEX(connection_id, sgtin)
);

Then you can simply join between both tables

SELECT sgtins_filter.sgtin
FROM sgtins_filter
INNER JOIN sgtins
ON
    sgtins_filter.sgtin = sgtins.sgtin
  AND
    sgtins_filter.connection_id = CONNECTION_ID()
  AND 
    sgtins.client_id = 4; 

see demo

like image 37
Raymond Nijland Avatar answered Sep 22 '22 00:09

Raymond Nijland