Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

search by date mysql performance

Tags:

mysql

I have a large table with about 100 million records, with fields start_date and end_date, with DATE type. I need to check the number of overlaps with some date range, say between 2013-08-20 AND 2013-08-30, So I use.

SELECT COUNT(*) FROM myTable WHERE end_date >= '2013-08-20' 
AND start_date <= '2013-08-30'

date column are indexed. The important points is that the date ranges that I am searching for overlap are always in the future, while the main part of the records in the table are in the past (say about 97-99 million). So, will this query be faster, if I add a column is_future - TINYINT, so, by checking only that condition like this

SELECT COUNT(*) FROM myTable WHERE is_future = 1 
AND end_date >= '2013-08-20' AND start_date <= '2013-08-30'

it will exclude the rest 97 million or so records and will check the date condition for only the remaining 1-3 million records ?

I use MySQL

Thanks

EDIT

The mysql engine is innodb, but will matter considerably if it is say, MyISAM

here is the create table

CREATE TABLE `orders` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `title`
  `start_date` date DEFAULT NULL,
  `end_date` date DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=24 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

EDIT 2 after @Robert Co answer

The partitioning looks like a good idea for this case, but it does not allow me to create partition based on is_future field unless I define it as primary key, otherwise I should remove my main primary key - id, which I can not do. So, if I define that field as primary key, then is there a meaning of partitioning, will not it be fast already if I search by is_future field which is primary key.

EDIT 3 The actual query where I need to use this is to select restaurant that have some free tables for that date range

SELECT r.id, r.name, r.table_count
FROM restaurants r
LEFT JOIN orders o 
ON r.id = o.restaurant_id 
WHERE o.id IS NULL 
OR (r.table_count > (SELECT COUNT(*) 
                FROM orders o2 
                WHERE o2.restaurant_id = r.id AND
                end_date >= '2013-08-20' AND start_date <= '2013-08-30'
                AND o2.status = 1
            )
) 

SOLUTION After a lot more research and testing the fastest way for counting the number of rows in my case was to just add one more condition, that start_date is more than current date (because the date ranges for search are always in the future)

 SELECT COUNT(*) FROM myTable WHERE end_date >= '2013-09-01' 
         AND start_date >= '2013-08-20' AND start_date <= '2013-09-30'

also it is necessary to have one index - with start_date and end_date fields (thank you @symcbean). As a result the execution time on table with 10m rows from 7 seconds - became 0.050 seconds.

SOLUTION 2 (@Robert Co) partitioning in this case worked as well !! - perhaps it is better solution than indexing. Or they can both be applied together.

Thanks

like image 491
dav Avatar asked Aug 25 '13 14:08

dav


2 Answers

This is a perfect use case for table partitioning. If the Oracle INTERVAL feature makes it to MySQL, then it will just add to the awesomeness.

like image 183
Robert Co Avatar answered Oct 03 '22 00:10

Robert Co


date column are indexed

What type of index? A hash based index is no use for range queries. If it's not a BTREE index then change it now. And you've not shown us *how they are indexed. Are both columns in the same index? Is there other stuff in there too? What order (end_date must appear as the first column)?

There are implicit type conversions in the script - this should be handled automatically by the optimizer, but it's worth checking....

SELECT COUNT(*) FROM myTable WHERE end_date >= 20130820000000 
AND start_date <= 20130830235959

if I add a column is_future - TINYINT

First, in order to be of any use, this would require that the future dates be a small proportion of the total data stored in the table (less than 10%). And that's just to make it more efficient than a full table scan.

Secondly, it's going to require very frequent updates to the index to maintain it, which in addition to the overhead of initial populatiopn is likely to lead to fragmentation of the index and degraded performance (depending on how the iondex is constructed).

Thirdly, if this still has to process 3 million rows of data (and specifically, via an index lookup) then it's going to be very slow even with the data pegged in memory.

Further, the optimizer is never likely to use this index without being forced to (due to the low cardinality).

like image 26
symcbean Avatar answered Oct 03 '22 00:10

symcbean