When I add 'distinct' to my query, query time increases from 0.015 to over 6 seconds.
I want to join several tables, which are linked via foreign keys and get a distinct column from it:
select distinct table3.idtable3 from
table1
join table2 on table1.idtable1 = table2.fkey
join table3 on table2.idtable2 = table3.fkey
where table1.idtable1 = 1
The distinct query takes 6 seconds which seems to me to be improvable.
With select:
duration : 0.015s / fetch:5.532s (5.760.434 rows)
Explain:
id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
1 SIMPLE table1 index asd asd 137 10 10.00 Using where; Using index
1 SIMPLE table2 ALL idtable2 200 25.00 Using where; Using join buffer (Block Nested Loop)
1 SIMPLE table3 ref fkey_table2_table_3_idx fkey_table2_table_3_idx 138 mydb.table2.idtable2 66641 100.00
With distinct select:
duration : 6.625s / fetch:0.000s (1000 rows)
Explain:
id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
1 SIMPLE table1 index asd asd 137 10 10.00 Using where; Using index; Using temporary
1 SIMPLE table2 ALL idtable2 200 25.00 Using where; Using join buffer (Block Nested Loop)
1 SIMPLE table3 ref fkey_table2_table_3_idx fkey_table2_table_3_idx 138 mydb.table2.idtable2 66641 100.00
Database: Database snippet
Code for testing / MCRE:
import mysql.connector
import time
import numpy as np
"""
-- MySQL Script generated by MySQL Workbench
-- Fri Jan 17 12:19:26 2020
-- Model: New Model Version: 1.0
-- MySQL Workbench Forward Engineering
SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;
SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION';
-- -----------------------------------------------------
-- Schema mydb
-- -----------------------------------------------------
-- -----------------------------------------------------
-- Schema mydb
-- -----------------------------------------------------
CREATE SCHEMA IF NOT EXISTS `mydb` DEFAULT CHARACTER SET utf8 ;
USE `mydb` ;
-- -----------------------------------------------------
-- Table `mydb`.`table1`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `mydb`.`table1` (
`idtable1` VARCHAR(45) NOT NULL,
INDEX `asd` (`idtable1` ASC) VISIBLE)
ENGINE = InnoDB;
-- -----------------------------------------------------
-- Table `mydb`.`table2`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `mydb`.`table2` (
`idtable2` VARCHAR(45) NOT NULL,
`fkey` VARCHAR(45) NULL,
INDEX `link_table1_table2_idx` (`fkey` ASC) INVISIBLE,
INDEX `idtable2` (`idtable2` ASC) VISIBLE,
CONSTRAINT `link_table1_table2`
FOREIGN KEY (`fkey`)
REFERENCES `mydb`.`table1` (`idtable1`)
ON DELETE NO ACTION
ON UPDATE NO ACTION)
ENGINE = InnoDB;
-- -----------------------------------------------------
-- Table `mydb`.`table3`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `mydb`.`table3` (
`idtable3` VARCHAR(45) NOT NULL,
`fkey` VARCHAR(45) NULL,
INDEX `fkey_table2_table_3_idx` (`fkey` ASC) VISIBLE,
CONSTRAINT `fkey_table2_table_3`
FOREIGN KEY (`fkey`)
REFERENCES `mydb`.`table2` (`idtable2`)
ON DELETE NO ACTION
ON UPDATE NO ACTION)
ENGINE = InnoDB;
SET SQL_MODE=@OLD_SQL_MODE;
SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS;
SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS;
"""
def insertData():
for i in range(2):
num_distinct_table1_values = 5
num_distinct_table2_values = 10
num_distinct_table3_values = 1000
num_entries_table1 = int(num_distinct_table1_values)
num_entries_table2 = int(num_distinct_table2_values * 10)
num_entries_table3 = int(num_distinct_table3_values * 300)
random_numbers_table1_id = range(num_distinct_table1_values)
random_numbers_table2_id = np.random.randint(num_distinct_table2_values, size=int(num_entries_table2))
random_numbers_table2_fkey = np.random.randint(num_distinct_table1_values, size=int(num_entries_table2))
random_numbers_table3_id = np.random.randint(num_distinct_table3_values, size=int(num_entries_table3))
random_numbers_table3_fkey = np.random.randint(num_distinct_table2_values, size=int(num_entries_table3))
value_string_table1 = ','.join([f"('{i_name}')" for i_name in random_numbers_table1_id])
value_string_table2=""
for i in range(num_entries_table2):
value_string_table2 = value_string_table2+','.join(
["('{id}','{fkey}'),".format(id=random_numbers_table2_id[i], fkey=random_numbers_table2_fkey[i])])
value_string_table3=""
for i in range(num_entries_table3):
value_string_table3 = value_string_table3+','.join(
["('{id}','{fkey}'),".format(id=random_numbers_table3_id[i], fkey=random_numbers_table3_fkey[i])])
# fill table 1
mySql_insert_query = f"INSERT INTO table1 (idtable1) VALUES {value_string_table1}"
cursor.execute(mySql_insert_query)
conn.commit()
print("Done table 1")
# fill table 2
mySql_insert_query = f"INSERT INTO table2 (idtable2, fkey) VALUES {value_string_table2}"
mySql_insert_query=mySql_insert_query[0:-1]
cursor.execute(mySql_insert_query)
print("Done table 2")
# fill table 3
mySql_insert_query = f"INSERT INTO table3 (idtable3, fkey) VALUES {value_string_table3}"
mySql_insert_query = mySql_insert_query[0:- 1]
cursor.execute(mySql_insert_query)
print("Done table 3")
conn.commit()
conn = mysql.connector.connect(user='root', password='admin', host='127.0.0.1',
database='mydb', raise_on_warnings=True, autocommit=False)
cursor = conn.cursor()
insertData()
conn.close()
Very few queries may perform faster in SELECT DISTINCT mode, and very few will perform slower (but not significantly slower) in SELECT DISTINCT mode but for the later case it is likely that the application may need to examine the duplicate cases, which shifts the performance and complexity burden to the application.
Yes, the application needs to compare every record to the "distinct" records cache as it goes. You can improve performance by using an index, particularly on the numeric and date fields.
DISTINCT is used to filter unique records out of all records in the table. It removes the duplicate rows. SELECT DISTINCT will always be the same, or faster than a GROUP BY.
Summary. DISTINCT makes a query return unique rows only, and is often used injudiciously to suppress duplicate rows being returned by a bad query. Don't use DISTINCT to cover up errors.
Thanks for the CREATE TABLEs
; you might never have gotten an Answer without them.
PRIMARY KEY
. If you have a column (or combination of columns) that 'naturally' works, use that. Else use an AUTO_INCREMENT
.INDEX(fkey)
is INVISIBLE
, hence not used. Don't waste learning time on VISIBLE
/INVISIBLE
, you may never need them in your career.And...
duration : 0.015s / fetch:5.532s (5.760.434 rows)
duration : 6.625s / fetch:0.000s (1000 rows)
Notice how both are about 6 seconds. It is just that the time is split up differently.
DISTINCT
, the query can pump out the data immediately, but takes a long time due to network latency.DISTINCT
, the first row cannot come out until after preforming the "de-duplication", which probably involves a "temporary" (see the EXPLAIN
) and a sort. So, now all the time is involved in computing before sending the data.DISTINCT
one is slightly slower (total time) because of the extra step of collecting and sorting 5.7M rows.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With