Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SELECT DISTINCT statement in MySQL is taking 10 minutes

I'm reasonably new to MySQL and I'm trying to select a distinct set of rows using this statement:

SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4);

However, the select statement is taking around 10 minutes, so something is clearly afoot.

One significant factor is that the table gtfsstop_times is huge. (~250 million records)

Indexes seem to be set up properly; all the above joins are using indexed columns. Table sizes are, roughly:

gtfsagencys - 4 rows
gtfsroutes - 56,000 rows
gtfstrips - 5,500,000 rows
gtfsstop_times - 250,000,000 rows
`transportdata`.stoppoints - 400,000 rows

The server has 22Gb of memory, I've set the InnoDB buffer pool to 8G and I'm using MySQL 5.6.

Can anybody see a way of making this run faster? Or indeed, at all!

Does it matter that the stoppoints table is in a different schema?

EDIT: EXPLAIN SELECT... returns this:

enter image description here

like image 614
Carlos P Avatar asked Apr 15 '13 15:04

Carlos P


People also ask

Does SELECT distinct take longer?

Very few queries may perform faster in SELECT DISTINCT mode, and very few will perform slower (but not significantly slower) in SELECT DISTINCT mode but for the later case it is likely that the application may need to examine the duplicate cases, which shifts the performance and complexity burden to the application.

Which is faster SELECT or SELECT distinct?

DISTINCT is used to filter unique records out of all records in the table. It removes the duplicate rows. SELECT DISTINCT will always be the same, or faster than a GROUP BY.

How do I SELECT distinct data in MySQL?

To get unique or distinct values of a column in MySQL Table, use the following SQL Query. SELECT DISTINCT(column_name) FROM your_table_name; You can select distinct values for one or more columns. The column names has to be separated with comma.

Does distinct take the first?

Conclusion. Bare usage of DISTINCT will return the first occurrence. However, it can work either way by sorting the initial results first before conducting the distinction step.


2 Answers

It looks like you are trying to find a collection of stop points, based on certain criteria. And, you're using SELECT DISTINCT to avoid duplicate stop points. Is that right?

It looks like atcoCode is a unique key for your stoppoints table. Is that right?

If so, try this:

SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
  FROM `transportdata`.stoppoints` AS sp
  JOIN ( 
     SELECT DISTINCT st.fk_atco_code AS atcoCode
       FROM `vehicledata`.gtfsroutes AS route
       JOIN `vehicledata`.gtfstrips AS trip ON trip.route_id = route.route_id
       JOIN `vehicledata`.gtfsstop_times AS st  ON trip.trip_id = st.trip_id
       WHERE route.agency_id BETWEEN 1 AND 4
  ) ids ON sp.atcoCode = ids.atcoCode

This does a few things: It eliminates a table (agency) which you don't seem to need. It changes the search on agency_id from IN(a,b,c) to a range search, which may or may not help. And finally it relocates the DISTINCT processing from a situation where it has to handle a whole ton of data to a subquery situation where it only has to handle the ID values.

(JOIN and INNER JOIN are the same. I used JOIN to make the query a bit easier to read.)

This should speed you up a bit. But, it has to be said, a quarter gigarow table is a big table.

like image 171
O. Jones Avatar answered Oct 26 '22 10:10

O. Jones


Having 250M records, I would shard the gtfsstop_times table on one column. Then each sharded table can be joined in a separate query that can run parallel in separate threads, you'll only need to merge the result sets.

like image 28
András Hummer Avatar answered Oct 26 '22 08:10

András Hummer