How can I improve query performance for 200+ million records

Tags:

Background

I have a MySQL test environment with a table which contains over 200 million rows. On this table have to execute two types of queries;

Do certain rows exists. Given a client_id and a list of sgtins, which can hold up to 50.000 items, I need to know which sgtins are present in the table.
Select those rows. Given a client_id and a list of sgtins, which can hold up to 50.000 items, I need to fetch the full row. (store, gtin...)

The table can grow to 200+ millions record for a single 'client_id'.

Test environment

Xeon E3-1545M / 32GB RAM / SSD. InnoDB buffer pool 24GB. (Production will be a larger server with 192GB RAM)

Table

CREATE TABLE `sgtins` (
  `client_id` INT UNSIGNED NOT NULL,
  `sgtin` varchar(255) NOT NULL,
  `store` varchar(255) NOT NULL,
  `gtin` varchar(255) NOT NULL,
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  INDEX (`client_id`, `store`, `sgtin`),
  INDEX (`client_id`),
  PRIMARY KEY (`client_id`,`sgtin`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Tests

First I generated random sgtin values spread over 10 'client_id's to fill the table with 200 million rows.

I created a benchmark tool which executes various queries I tried. Also I used the explain plan to find out which performance best. This tool will read, for every test, new random data from the data I used to fill the database. To ensure every query is different.

For this post I will use 28 sgtins. Temp table

CREATE TEMPORARY TABLE sgtins_tmp_table (`sgtin` varchar(255) primary key)
 engine=MEMORY;

Exist query

I use this query for find out if the sgtins exist. Also this is the fastest query I found. For 50K sgtins this query will take between 3 and 9 seconds.

-- cost = 17 for 28 sgtins loaded in the temp table.
SELECT sgtin
FROM sgtins_tmp_table
WHERE EXISTS 
  (SELECT sgtin FROM sgtins 
  WHERE sgtins.client_id = 4 
  AND sgtins.sgtin = sgtins_tmp_table.sgtin);

Explain plan

Select queries

-- cost = 50.60 for 28 sgtins loaded in the temp table. 50K not usable.
SELECT sgtins.sgtin, sgtins.store, sgtins.timestamp
FROM sgtins_tmp_table, sgtins
WHERE sgtins.client_id = 4
AND sgtins_tmp_table.sgtin = sgtins.sgtin;

Explain plan

-- cost = 64 for 28 sgtins loaded in the temp table.
SELECT sgtins.sgtin, sgtins.store, sgtins.timestamp
FROM sgtins
WHERE sgtins.client_id = 4
AND sgtins.sgtin IN ( SELECT sgtins_tmp_table.sgtin
 FROM sgtins_tmp_table);

Explain plan

-- cost = 50.60 for 28 sgtins loaded in the temp table.
SELECT sgtins_tmp_table.epc, sgtins.store
FROM sgtins_tmp_table, sgtins
WHERE exists (SELECT organization_id, sgtin FROM sgtins WHERE client_id = 4 AND sgtins.sgtin = sgtins_tmp_table.sgtin)
AND sgtins.client_id = 4
AND sgtins_tmp_table.sgtin = sgtins.sgtin;

Explain plan

Summary

The exist query is usable but the selects are to slow. What can I do about it? And any advice is welcome :)

855

asked Jun 13 '19 08:06

Mark Ebbers

Video Answer

2 Answers

I would write your exists query like this:

SELECT stt.sgtin
FROM sgtins_tmp_table stt
WHERE EXISTS (SELECT 1
              FROM sgtins s
              WHERE s.client_id = 4 AND
                    s.sgtin = stt.sgtin
             );

For this query, you want an index on sgtins(sgtin, client_id).

166

answered Sep 22 '22 00:09

Gordon Linoff

i would suggest to rewite your EXISTS SQL as corelated subqueries tends to optimize badly most off the time.
The suggested query would be to use a INNER JOIN instead.

SELECT filter.sgtin
FROM (SELECT '<value>' AS sgtin UNION ALL SELECT '<value>' ..) AS filter
INNER JOIN sgtins ON filter.sgtin = sgtins.sgtin WHERE sgtins.client_id = 4

As most likely this is faster then using a temporary table.
But your are dealing with 50K values so i would make sense to generate the needed derived table SQL with dynamic SQL directly from the temporary table.

Also like i suggested in the chat.
Making a index (sgtins, client_id) would most likely make more sense depending on the data selectivity which is not really clear.
As that index might make your corelated subquery faster.

Query

# Maybe also needed to be changed with 50 K 
# SET SESSION max_allowed_packet = ??; 


# needed for GROUP_CONCAT as if defualts to only 1024 
SET SESSION group_concat_max_len = @@max_allowed_packet;

SET @UNION_SQL = NULL;

SELECT
  CONCAT(
       'SELECT '
    ,  GROUP_CONCAT(
          CONCAT("'", sgtins_tmp_table.sgtin,"'", ' AS sgtin')
          SEPARATOR ' UNION ALL SELECT '
       )
  )
FROM
 sgtins_tmp_table
INTO
 @UNION_SQL;


SET @SQL = CONCAT("
SELECT filter.sgtin
FROM (",@UNION_SQL,") AS filter
INNER JOIN sgtins ON filter.sgtin = sgtins.sgtin WHERE sgtins.client_id = 4
");


PREPARE q FROM @SQL;
EXECUTE q;

see demo

Editted because of comments

A more ideal approach would be using a fixed table which you index and use CONNECTION_ID() to separate the search values.

CREATE TABLE sgtins_filter (
    connection_id INT
  , sgtin varchar(255) NOT NULL
  , INDEX(connection_id, sgtin)
);

Then you can simply join between both tables

SELECT sgtins_filter.sgtin
FROM sgtins_filter
INNER JOIN sgtins
ON
    sgtins_filter.sgtin = sgtins.sgtin
  AND
    sgtins_filter.connection_id = CONNECTION_ID()
  AND 
    sgtins.client_id = 4;

see demo

answered Sep 22 '22 00:09

Raymond Nijland

Related questions
                            
                                What is the time complexity for fetching a row from a MySQL database when searching on the primary key?
                            
                                MYSQL LAST_INSERT_ID not working as intended. How do I fix?
                            
                                InnerException : Table 'xxx.aspnetusers' doesn't exist
                            
                                Detecting patterns in table
                            
                                Convert column collation to table/database default
                            
                                how to solve "Process terminated because the request deadline was exceeded. (Error code 123)" in google api?
                            
                                Loopback + connect multiple database
                            
                                ALTER TABLE lhm migration deletes existing values
                            
                                bigint truncated via PDO?
                            
                                Does Hibernate automatically create index?
                            
                                Comparing only the selected value in loop to avoid error message
                            
                                MySQL "mysql there is no such grant defined for user"
                            
                                How to make a tuple be unique in SQL table?
                            
                                Python pyodbc cursor vs database cursor
                            
                                Connect to staging database from a local environment - Laravel 5
                            
                                Delete orphans when deleting parent manyToOne annotaion
                            
                                Spark build in hive MySQL metastore isn't being used
                            
                                EF: Incorrect usage of spatial/fulltext/hash index and explicit index order
                            
                                How to Summarize Consecutive Rows in a MySQL Table
                            
                                SQL query with join to get nested array of objects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I improve query performance for 200+ million records

Tags:

sql

mysql

temp-tables

database-performance

Mark Ebbers

People also ask

Video Answer

2 Answers

Gordon Linoff

Raymond Nijland

Recent Activity

Donate For Us