I have the following People table:
| Id | FirstName | Children |
|----|-----------|----------|
| 1 | mark | 4 |
| 2 | paul | 0 |
| 3 | mike | 3 |
Note I have an non-unique index in FirstName and another one in Children.
I need to get the top 10000 first names and the children amount of every person that has any children. So I decided to go for this solution:
SELECT firstName, children FROM people
WHERE children > 0
ORDER BY children DESC
LIMIT 0, 10000
The thing is that it takes 4 seconds to return the results from a table with 2.6 million records. This is the explain:
| ID | SELECT_TYPE | TABLE | TYPE | POSSIBLE_KEYS | KEY | KEY_LEN | REF | ROWS | EXTRA |
|----|-------------|--------|-------|---------------|----------|---------|--------|------------|-------------|
| 1 | SIMPLE | people | range | children | children | 4 | (null) | 2677610 | Using where |
As I see it, the range is telling me the index is being scanned and compared to a value (in this case this is the children > 0). I'd say that should be fast enough. Then, my guess is that after getting all those matching index elements, the DBMS fetches the firstName from the table by internally joining the values in the index with the ones in the table.
If I translate the previous paragraph into SQL I would get something like this:
SELECT firstName, children FROM people
JOIN (
SELECT id FROM people
WHERE children > 0
ORDER BY children DESC
LIMIT 0, 10000
) s
ON people.id = s.id
ORDER BY children DESC
The explain for the previous SQL statement is:
| ID | SELECT_TYPE | TABLE | TYPE | POSSIBLE_KEYS | KEY | KEY_LEN | REF | ROWS | EXTRA |
|----|-------------|------------|--------|---------------|----------|---------|--------|---------|---------------------------------|
| 1 | PRIMARY | <derived2> | ALL | (null) | (null) | (null) | (null) | 10000 | Using temporary; Using filesort |
| 1 | PRIMARY | p | eq_ref | PRIMARY | PRIMARY | 4 | s.id | 1 | |
| 2 | DERIVED | people | range | children | children | 4 | (null) | 2687462 | Using where; Using index |
To my surprise, this query performs a few times faster than the first one. However, the more I increment the LIMIT X the bigger this difference becomes (EG: For LIMIT 1000000, 10000 the second query is still under 1 second and the first one exceeds 20 seconds). This leads me to the following questions:
Additional notes:
I'm pretty sure that you can fix the performance of the first query by having an index on children, firstname
. This is a covering index for the query, so it should eliminate accesses to the data pages.
The first execution plan says that the index is being used for the where
. The limit
is applied last, so it seems that it is getting the firstname
for all the rows before the limit
is applied. That seems weird, but it is consistent with the performance you are seeing.
In the second version, 10000 ids are being read. Assuming they are primary keys, then the data page look ups should be quite fast -- and controlled explicitly by the limit. This may suggest why this version is faster, although it seems like a bit of a mystery. Mostly, though, I would expect the index on children, firstname
to improve the first version of the query.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With