Needed help in optimizing order by and count query, I have tables having millions (approx 3 millions) rows.
I have to join 4 tables and fetch the records, When i run the simple query it takes only millisecond to complete but as I try to count or order by having left join table it get stuck for unlimited of time.
Please see the cases below.
CPU Number of virtual cores: 4
Memory(RAM): 16 GiB
Network Performance: High
tbl_customers - #Rows: 20 million.
tbl_customers_address - #Row 25 million.
tbl_shop_setting - #Rows 50k
aio_customer_tracking - #Rows 5k
CREATE TABLE `tbl_customers` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
`shop_id` BIGINT(20) UNSIGNED NOT NULL,
`email` VARCHAR(225) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
`accepts_marketing` TINYINT(1) NULL DEFAULT NULL,
`first_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
`last_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
`last_order_id` BIGINT(20) NULL DEFAULT NULL,
`total_spent` DECIMAL(12,2) NULL DEFAULT NULL,
`phone` VARCHAR(20) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
`verified_email` TINYINT(4) NULL DEFAULT NULL,
`updated_at` DATETIME NULL DEFAULT NULL,
`created_at` DATETIME NULL DEFAULT NULL,
`date_updated` DATETIME NULL DEFAULT NULL,
`date_created` DATETIME NULL DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`),
INDEX `email` (`email`),
INDEX `shopify_customer_id` (`shopify_customer_id`),
INDEX `shop_id` (`shop_id`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;
CREATE TABLE `tbl_customers_address` (
`id` BIGINT(20) NOT NULL AUTO_INCREMENT,
`customer_id` BIGINT(20) NULL DEFAULT NULL,
`shopify_address_id` BIGINT(20) NULL DEFAULT NULL,
`shopify_customer_id` BIGINT(20) NULL DEFAULT NULL,
`first_name` VARCHAR(50) NULL DEFAULT NULL,
`last_name` VARCHAR(50) NULL DEFAULT NULL,
`company` VARCHAR(50) NULL DEFAULT NULL,
`address1` VARCHAR(250) NULL DEFAULT NULL,
`address2` VARCHAR(250) NULL DEFAULT NULL,
`city` VARCHAR(50) NULL DEFAULT NULL,
`province` VARCHAR(50) NULL DEFAULT NULL,
`country` VARCHAR(50) NULL DEFAULT NULL,
`zip` VARCHAR(15) NULL DEFAULT NULL,
`phone` VARCHAR(20) NULL DEFAULT NULL,
`name` VARCHAR(50) NULL DEFAULT NULL,
`province_code` VARCHAR(5) NULL DEFAULT NULL,
`country_code` VARCHAR(5) NULL DEFAULT NULL,
`country_name` VARCHAR(50) NULL DEFAULT NULL,
`longitude` VARCHAR(250) NULL DEFAULT NULL,
`latitude` VARCHAR(250) NULL DEFAULT NULL,
`default` TINYINT(1) NULL DEFAULT NULL,
`is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
INDEX `customer_id` (`customer_id`),
INDEX `shopify_address_id` (`shopify_address_id`),
INDEX `shopify_customer_id` (`shopify_customer_id`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;
CREATE TABLE `tbl_shop_setting` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`shop_name` VARCHAR(300) NOT NULL COLLATE 'latin1_swedish_ci',
PRIMARY KEY (`id`),
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;
CREATE TABLE `aio_customer_tracking` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
`email` VARCHAR(255) NULL DEFAULT NULL,
`shop_id` BIGINT(20) UNSIGNED NOT NULL,
`domain` VARCHAR(255) NULL DEFAULT NULL,
`web_session_count` INT(11) NOT NULL,
`last_seen_date` DATETIME NULL DEFAULT NULL,
`last_contact_date` DATETIME NULL DEFAULT NULL,
`last_email_open` DATETIME NULL DEFAULT NULL,
`created_date` DATETIME NOT NULL,
`is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
INDEX `shopify_customer_id` (`shopify_customer_id`),
INDEX `email` (`email`),
INDEX `shopify_customer_id_shop_id` (`shopify_customer_id`, `shop_id`),
INDEX `last_seen_date` (`last_seen_date`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;
1. Running: Below query fetch the records by joining all the 4 tables, It takes only 0.300 ms.
SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20
2. Not running: Simply when try to get the count of these row stuk the query, I waited 10 min but still running.
SELECT
COUNT(DISTINCT c.shopify_customer_id) -- what makes #2 different
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20
3. Not running: In the #1 query we simply put the 1 Order by clause and it get stuck, I waited 10 min but still running. I study query optimization some article and tried by indexing, Right Join etc.. but still not working.
SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
ORDER BY `t`.`last_seen_date` -- what makes #3 different
LIMIT 20
EXPLAIN QUERY #1:
EXPLAIN QUERY #2:
EXPLAIN QUERY #3:
Any suggestion to optimize the query, table structure are welcome.
tbl_customers
table contains the customer info, tbl_customer_address
table contains the addresses of the customers(one customer may have multiple address), And aio_customer_tracking
table contains visiting records of the customer last_seen_date
is the visiting date.
Now, simply I want to fetch and count the customers, with their one of the address, and visiting info. Also, I may order by any of the column from these 3 tables, In my example i am ordering by last_seen_date (the default order). Hope this explanation helps to understand what i am trying to do.
In query #1, but not the other two, the optimizer can use
UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`)
to cut the query short for
GROUP BY c.shopify_customer_id
LIMIT 20
This is because it can stop after 20 items of the index. The query is not ultra-fast because of the derived table (subquery t
) that hits about 51K rows.
Query #2 may be slow simply because the Optimizer failed to notice and remove the redundant DISTINCT
. Instead, it may be thinking it can't stop after 20.
Query #3 must go entirely through table c
to get every shopify_customer_id
group. This is because the ORDER BY
prevents a short curcuit to get to the LIMIT 20
.
The columns in a GROUP BY
must include all the non-aggregate columns in the SELECT
except any that are uniquely defined by the group by columns. Since you have said that there can be multiple addresses for a single shopify_customer_id
, then fetching ca.address1
is not proper in connection with GROUP BY shopify_customer_id
. Similarly, the subquery seems to be improper with respect to last_seen_date, last_contact_date
.
In aio_customer_tracking
, this change (to a "covering" index) may help a little:
INDEX (`shopify_customer_id`)
to
INDEX (`shopify_customer_id`, `last_seen_date`, `last_contact_date`)
Dissecting the goal
Now, simply I want to ... count the customers
To count the customers, do this, but don't try to combine it with "fetching":
SELECT COUNT(*) FROM tbl_customers;
Now, simply I want to fetch ... the customers...
tbl_customers - #Rows: 20 million.
Surely you don't want to fetch 20 million rows! I don't want to think about how to try to do that. Please clarify. And I won't accept paginating through that many rows. Perhaps there is a WHERE
clause?? The WHERE
clause is (usually) the most important part of Optimization!
Now, simply I want to fetch ... the customers, with their one of the address, and visiting info.
Assuming that the WHERE
filters down to a "few" customers, then JOINing
to another table to get "any" address and "any" visiting info, may be problematical and/or inefficient. To require the "first" or "last" instead of "any" won't be any easier, but might be more meaningful.
May I suggest that your UI first find a few customers, then if the user wants, go to another page with all the addresses and all the visits. Or can the visits be in the hundreds or more?
Also, I may order by any of the column from these 3 tables, In my example i am ordering by last_seen_date (the default order).
Let's focus on optimizing the WHERE
, then tack last_seen_date
on the end of any index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With