SELECT DISTINCT statement in MySQL is taking 10 minutes

Tags:

mysql

I'm reasonably new to MySQL and I'm trying to select a distinct set of rows using this statement:

SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4);

However, the select statement is taking around 10 minutes, so something is clearly afoot.

One significant factor is that the table gtfsstop_times is huge. (~250 million records)

Indexes seem to be set up properly; all the above joins are using indexed columns. Table sizes are, roughly:

gtfsagencys - 4 rows
gtfsroutes - 56,000 rows
gtfstrips - 5,500,000 rows
gtfsstop_times - 250,000,000 rows
`transportdata`.stoppoints - 400,000 rows

The server has 22Gb of memory, I've set the InnoDB buffer pool to 8G and I'm using MySQL 5.6.

Can anybody see a way of making this run faster? Or indeed, at all!

Does it matter that the stoppoints table is in a different schema?

EDIT: EXPLAIN SELECT... returns this:

enter image description here

614

asked Apr 15 '13 15:04

2 Answers

It looks like you are trying to find a collection of stop points, based on certain criteria. And, you're using SELECT DISTINCT to avoid duplicate stop points. Is that right?

It looks like atcoCode is a unique key for your stoppoints table. Is that right?

If so, try this:

SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
  FROM `transportdata`.stoppoints` AS sp
  JOIN ( 
     SELECT DISTINCT st.fk_atco_code AS atcoCode
       FROM `vehicledata`.gtfsroutes AS route
       JOIN `vehicledata`.gtfstrips AS trip ON trip.route_id = route.route_id
       JOIN `vehicledata`.gtfsstop_times AS st  ON trip.trip_id = st.trip_id
       WHERE route.agency_id BETWEEN 1 AND 4
  ) ids ON sp.atcoCode = ids.atcoCode

This does a few things: It eliminates a table (agency) which you don't seem to need. It changes the search on agency_id from IN(a,b,c) to a range search, which may or may not help. And finally it relocates the DISTINCT processing from a situation where it has to handle a whole ton of data to a subquery situation where it only has to handle the ID values.

(JOIN and INNER JOIN are the same. I used JOIN to make the query a bit easier to read.)

This should speed you up a bit. But, it has to be said, a quarter gigarow table is a big table.

171

answered Oct 26 '22 10:10

O. Jones

Having 250M records, I would shard the gtfsstop_times table on one column. Then each sharded table can be joined in a separate query that can run parallel in separate threads, you'll only need to merge the result sets.

answered Oct 26 '22 08:10

András Hummer

Related questions
                            
                                How can I query a MySQL database from a Rails app without models?
                            
                                Yii Model with composite primary key
                            
                                limit, choice the UNION duplicate checking columns
                            
                                MySql: Date_add returns BLOB
                            
                                MySQL order by multiple case statements
                            
                                mysql to select all from one table inner join with another table on some condition fails [closed]
                            
                                MySQL Insert with select subquery for one value
                            
                                Delete MySql rows, or mark "dead"?
                            
                                MySQL column defaults - advantages/disadvantages and should I use for all columns?
                            
                                Update MYSQL with jQuery/AJAX
                            
                                How to populate <select> with ENUM values?
                            
                                HeidiSQL import csv empty fields
                            
                                PHP / MySQL: Joining three tables and merging results [duplicate]
                            
                                AttributeError: 'long' object has no attribute 'fetchall'
                            
                                mysql select id and name from other table and join query
                            
                                Change mysql field name in a huge table
                            
                                Pickle to file instead of using database
                            
                                Get MAX from a GROUP BY
                            
                                Yii composite primary keys with isNewRecord
                            
                                recover mysql database from ibdata1

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SELECT DISTINCT statement in MySQL is taking 10 minutes

Tags:

performance

mysql

Carlos P

People also ask

2 Answers

O. Jones

András Hummer

Recent Activity

Donate For Us