Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching the 5 closest locations to a zip code - what way should I go?

What I want:

  1. The user passes in a zip code or city name
  2. I search my database for the 5 closest locations
  3. Display the 5 closest locations near that position to the user

What I have so far:

Let's say a table of places with the following content:

(about 16000 rows)

CREATE TABLE `locations` (
 `locationID` int(11) NOT NULL AUTO_INCREMENT,
 `name` varchar(150) NOT NULL,
 `firstname` varchar(100) DEFAULT NULL,
 `lastname` varchar(100) DEFAULT NULL,
 `street` varchar(100) NOT NULL,
 `city` varchar(100) NOT NULL,
 `state` varchar(100) NOT NULL,
 `zipcode` varchar(10) NOT NULL,
 `phone` varchar(20) NOT NULL,
 `web` varchar(255) DEFAULT NULL,
 `machine` enum('Unbekannt','Foo','Bar') DEFAULT 'Unbekannt',
 `surface` enum('Unbekannt','Foo','Bar','') DEFAULT 'Unbekannt',
 PRIMARY KEY (`locationID`)
) ENGINE=InnoDB AUTO_INCREMENT=25 DEFAULT CHARSET=utf8
  1. ID
  2. name
  3. zip code
  4. city

Now I've got a second table with all towns of the world:

(about 3.4 million rows)

CREATE TABLE `geoData` (
 `geoID` int(11) NOT NULL AUTO_INCREMENT,
 `countryCode` char(2) NOT NULL,
 `zipCode` varchar(20) NOT NULL,
 `name` varchar(180) NOT NULL,
 `state` varchar(100) NOT NULL,
 `stateCode` varchar(20) NOT NULL,
 `county` varchar(100) NOT NULL,
 `countyCode` varchar(20) NOT NULL,
 `community` varchar(100) NOT NULL,
 `communityCode` varchar(20) NOT NULL,
 `lat` mediumint(6) NOT NULL,
 `lon` mediumint(6) NOT NULL,
 PRIMARY KEY (`lon`,`lat`,`geoID`) USING BTREE,
 KEY `geoID` (`geoID`)
) ENGINE=InnoDB AUTO_INCREMENT=16482 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE (lat)
(PARTITION p0 VALUES LESS THAN (-880000) ENGINE = InnoDB,
PARTITION p1 VALUES LESS THAN (-860000) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN (-840000) ENGINE = InnoDB,
PARTITION p3 VALUES LESS THAN (-820000) ENGINE = InnoDB,
PARTITION p4 VALUES LESS THAN (-800000) ENGINE = InnoDB,
PARTITION p5 VALUES LESS THAN (-780000) ENGINE = InnoDB,
PARTITION p6 VALUES LESS THAN (-760000) ENGINE = InnoDB,
PARTITION p7 VALUES LESS THAN (-740000) ENGINE = InnoDB,
PARTITION p8 VALUES LESS THAN (-720000) ENGINE = InnoDB,
PARTITION p9 VALUES LESS THAN (-700000) ENGINE = InnoDB,
PARTITION p10 VALUES LESS THAN (-680000) ENGINE = InnoDB,
PARTITION p11 VALUES LESS THAN (-660000) ENGINE = InnoDB,
PARTITION p12 VALUES LESS THAN (-640000) ENGINE = InnoDB,
PARTITION p13 VALUES LESS THAN (-620000) ENGINE = InnoDB,
PARTITION p14 VALUES LESS THAN (-600000) ENGINE = InnoDB,
PARTITION p15 VALUES LESS THAN (-580000) ENGINE = InnoDB,
PARTITION p16 VALUES LESS THAN (-560000) ENGINE = InnoDB,
PARTITION p17 VALUES LESS THAN (-540000) ENGINE = InnoDB,
PARTITION p18 VALUES LESS THAN (-520000) ENGINE = InnoDB,
PARTITION p19 VALUES LESS THAN (-500000) ENGINE = InnoDB,
PARTITION p20 VALUES LESS THAN (-480000) ENGINE = InnoDB,
PARTITION p21 VALUES LESS THAN (-460000) ENGINE = InnoDB,
PARTITION p22 VALUES LESS THAN (-440000) ENGINE = InnoDB,
PARTITION p23 VALUES LESS THAN (-420000) ENGINE = InnoDB,
PARTITION p24 VALUES LESS THAN (-400000) ENGINE = InnoDB,
PARTITION p25 VALUES LESS THAN (-380000) ENGINE = InnoDB,
PARTITION p26 VALUES LESS THAN (-360000) ENGINE = InnoDB,
PARTITION p27 VALUES LESS THAN (-340000) ENGINE = InnoDB,
PARTITION p28 VALUES LESS THAN (-320000) ENGINE = InnoDB,
PARTITION p29 VALUES LESS THAN (-300000) ENGINE = InnoDB,
PARTITION p30 VALUES LESS THAN (-280000) ENGINE = InnoDB,
PARTITION p31 VALUES LESS THAN (-260000) ENGINE = InnoDB,
PARTITION p32 VALUES LESS THAN (-240000) ENGINE = InnoDB,
PARTITION p33 VALUES LESS THAN (-220000) ENGINE = InnoDB,
PARTITION p34 VALUES LESS THAN (-200000) ENGINE = InnoDB,
PARTITION p35 VALUES LESS THAN (-180000) ENGINE = InnoDB,
PARTITION p36 VALUES LESS THAN (-160000) ENGINE = InnoDB,
PARTITION p37 VALUES LESS THAN (-140000) ENGINE = InnoDB,
PARTITION p38 VALUES LESS THAN (-120000) ENGINE = InnoDB,
PARTITION p39 VALUES LESS THAN (-100000) ENGINE = InnoDB,
PARTITION p40 VALUES LESS THAN (-80000) ENGINE = InnoDB,
PARTITION p41 VALUES LESS THAN (-60000) ENGINE = InnoDB,
PARTITION p42 VALUES LESS THAN (-40000) ENGINE = InnoDB,
PARTITION p43 VALUES LESS THAN (-20000) ENGINE = InnoDB,
PARTITION p44 VALUES LESS THAN (0) ENGINE = InnoDB,
PARTITION p45 VALUES LESS THAN (20000) ENGINE = InnoDB,
PARTITION p46 VALUES LESS THAN (40000) ENGINE = InnoDB,
PARTITION p47 VALUES LESS THAN (60000) ENGINE = InnoDB,
PARTITION p48 VALUES LESS THAN (80000) ENGINE = InnoDB,
PARTITION p49 VALUES LESS THAN (100000) ENGINE = InnoDB,
PARTITION p50 VALUES LESS THAN (120000) ENGINE = InnoDB,
PARTITION p51 VALUES LESS THAN (140000) ENGINE = InnoDB,
PARTITION p52 VALUES LESS THAN (160000) ENGINE = InnoDB,
PARTITION p53 VALUES LESS THAN (180000) ENGINE = InnoDB,
PARTITION p54 VALUES LESS THAN (200000) ENGINE = InnoDB,
PARTITION p55 VALUES LESS THAN (220000) ENGINE = InnoDB,
PARTITION p56 VALUES LESS THAN (240000) ENGINE = InnoDB,
PARTITION p57 VALUES LESS THAN (260000) ENGINE = InnoDB,
PARTITION p58 VALUES LESS THAN (280000) ENGINE = InnoDB,
PARTITION p59 VALUES LESS THAN (300000) ENGINE = InnoDB,
PARTITION p60 VALUES LESS THAN (320000) ENGINE = InnoDB,
PARTITION p61 VALUES LESS THAN (340000) ENGINE = InnoDB,
PARTITION p62 VALUES LESS THAN (360000) ENGINE = InnoDB,
PARTITION p63 VALUES LESS THAN (380000) ENGINE = InnoDB,
PARTITION p64 VALUES LESS THAN (400000) ENGINE = InnoDB,
PARTITION p65 VALUES LESS THAN (420000) ENGINE = InnoDB,
PARTITION p66 VALUES LESS THAN (440000) ENGINE = InnoDB,
PARTITION p67 VALUES LESS THAN (460000) ENGINE = InnoDB,
PARTITION p68 VALUES LESS THAN (480000) ENGINE = InnoDB,
PARTITION p69 VALUES LESS THAN (500000) ENGINE = InnoDB,
PARTITION p70 VALUES LESS THAN (520000) ENGINE = InnoDB,
PARTITION p71 VALUES LESS THAN (540000) ENGINE = InnoDB,
PARTITION p72 VALUES LESS THAN (560000) ENGINE = InnoDB,
PARTITION p73 VALUES LESS THAN (580000) ENGINE = InnoDB,
PARTITION p74 VALUES LESS THAN (600000) ENGINE = InnoDB,
PARTITION p75 VALUES LESS THAN (620000) ENGINE = InnoDB,
PARTITION p76 VALUES LESS THAN (640000) ENGINE = InnoDB,
PARTITION p77 VALUES LESS THAN (660000) ENGINE = InnoDB,
PARTITION p78 VALUES LESS THAN (680000) ENGINE = InnoDB,
PARTITION p79 VALUES LESS THAN (700000) ENGINE = InnoDB,
PARTITION p80 VALUES LESS THAN (720000) ENGINE = InnoDB,
PARTITION p81 VALUES LESS THAN (740000) ENGINE = InnoDB,
PARTITION p82 VALUES LESS THAN (760000) ENGINE = InnoDB,
PARTITION p83 VALUES LESS THAN (780000) ENGINE = InnoDB,
PARTITION p84 VALUES LESS THAN (800000) ENGINE = InnoDB,
PARTITION p85 VALUES LESS THAN (820000) ENGINE = InnoDB,
PARTITION p86 VALUES LESS THAN (840000) ENGINE = InnoDB,
PARTITION p87 VALUES LESS THAN (860000) ENGINE = InnoDB,
PARTITION p88 VALUES LESS THAN (880000) ENGINE = InnoDB,
PARTITION p89 VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
  1. ID
  2. city
  3. zip code
  4. latitude
  5. longitude

based on this article and some other reading up on that matter I have a stored procedure which is giving me n locations/zip codes of the closest towns near a point (latitude/longitude).

My stored procedure:

    BEGIN
    DECLARE _deg2rad DOUBLE DEFAULT PI()/1800000;

    SET @my_lat := _my_lat,
        @my_lon := _my_lon,
        @deg2dist := 0.0111325,  
        @start_deg := _start_dist / @deg2dist,  
        @max_deg := _max_dist / @deg2dist,
        @cutoff := @max_deg / SQRT(2),  
        @dlat := @start_deg,  
        @lon2lat := COS(_deg2rad * @my_lat),
        @iterations := 0;        

    SET @sql = CONCAT(
        "SELECT COUNT(*) INTO @near_ct
            FROM geoData
            WHERE lat    BETWEEN @my_lat - @dlat
                             AND @my_lat + @dlat   
              AND lon    BETWEEN @my_lon - @dlon
                             AND @my_lon + @dlon");
    PREPARE _sql FROM @sql;
    MainLoop: LOOP
        SET @iterations := @iterations + 1;
        SET @dlon := ABS(@dlat / @lon2lat);  
        SET @dlon := IF(ABS(@my_lat) + @dlat >= 900000, 3600001, @dlon);  
        EXECUTE _sql;
        IF ( @near_ct >= _limit OR         
             @dlat >= @cutoff ) THEN       
            LEAVE MainLoop;
        END IF;
        SET @dlat := LEAST(2 * @dlat, @cutoff);   
    END LOOP MainLoop;
    DEALLOCATE PREPARE _sql;

    SET @dlat := IF( @dlat >= @max_deg OR @dlon >= 1800000,
                @max_deg,
                GCDist(ABS(@my_lat), @my_lon,
                       ABS(@my_lat) - @dlat, @my_lon - @dlon) );
    SET @dlon := IFNULL(ASIN(SIN(_deg2rad * @dlat) /
                             COS(_deg2rad * @my_lat))
                            / _deg2rad 
                        , 3600001);    


    IF (ABS(@my_lon) + @dlon < 1800000 OR    
        ABS(@my_lat) + @dlat <  900000) THEN 
        SET @sql = CONCAT(
            "SELECT *,
                    @deg2dist * GCDist(@my_lat, @my_lon, lat, lon) AS dist
                FROM geoData
                WHERE lat BETWEEN @my_lat - @dlat
                              AND @my_lat + @dlat   
                  AND lon BETWEEN @my_lon - @dlon
                              AND @my_lon + @dlon   
                HAVING dist <= ", _max_dist, "
                ORDER BY dist
                LIMIT ", _limit
                        );
    ELSE
        SET @west_lon := IF(@my_lon < 0, @my_lon, @my_lon - 3600000);
        SET @east_lon := @west_lon + 3600000;
        SET @sql = CONCAT(
            "( SELECT *,
                    @deg2dist * GCDist(@my_lat, @west_lon, lat, lon) AS dist
                FROM geoData
                WHERE lat BETWEEN @my_lat - @dlat
                              AND @my_lat + @dlat 
                  AND lon BETWEEN @west_lon - @dlon
                              AND @west_lon + @dlon   
                HAVING dist <= ", _max_dist, " )
            UNION ALL
            ( SELECT *,
                    @deg2dist * GCDist(@my_lat, @east_lon, lat, lon) AS dist
                FROM geoData
                WHERE lat BETWEEN @my_lat - @dlat
                              AND @my_lat + @dlat   
                  AND lon BETWEEN @east_lon - @dlon
                              AND @east_lon + @dlon   
                HAVING dist <= ", _max_dist, " )
            ORDER BY dist
            LIMIT ", _limit
                        );
    END IF;

    PREPARE _sql FROM @sql;
    EXECUTE _sql;
    DEALLOCATE PREPARE _sql;
END

My Problem:

I would like to pass in a zip code or name of a town and start my search from there. So my thought was I request this information and look up my table of all towns/zip codes from the world. After that, I have the information of lat/lon if only one result was found or I would ask the user to select the right choice in that case of having multiple results.

After that, I start searching for nearest towns close to my current position. Let's say I want a list of 50 towns/cities. And with that, I'd go and look up and see if the table containing the locations matches 5 results in there.

On second thought, this sounds like a bad idea...

Approach 1:

I read up on stored procedures, sql and monster queries and try to get the following:

Passing in a zip code/city name I would look that up, take my lat/lon from the huge table (possible as the function in mysql) and with that given I'd look for the nearest towns and join right then and there the locations table and get my 5 closest locations.

Questions:

  • How would I avoid having several matches for the same name of a city/zip code?
  • Does it sound possible to do so with a simple join in order to get the 5 closest locations?

Approach 2:

Get all the lat/lon values of my locations and then run the procedure on this table instead. And just use the huge table in order to retrieve my current position?

With that, I would need to gather all the lat/lon of my locations though. But it might be the best way.

But having the huge database of all cities/zip codes just to get the locations seems like a bit of an overkill. I would hope there is an alternative then maybe... somehow...

Approach 3

To be honest, this function I want seems like written a million times before. So why should I bother reinventing the wheel? But I have no clue how to find the right articles or books in order to accomplish my goal.

Has anyone else of you an idea for the best practice for something like that?

like image 287
floGalen Avatar asked May 25 '18 13:05

floGalen


2 Answers

First some comments...

I've seen dozens (not millions) of implementation here and on other forums; yours is better than most.

According to one data source (which I happen to have downloaded) there are about 3.2 million cities in the world.

For performance, you need to avoid checking all 3M rows. You have made a good start with the growing bounding box. Note that you should have

INDEX(lat, lon),
INDEX(lon, lat)

The Optimizer will choose between those and the first query (with the COUNT(*)) will see that as 'covering'. It will be a stripe around the globe or a wedge; a definite improvement over 3M rows. The worst latitude (+34 degrees) has 96K cities in it. (1 degree = 69 miles / 111 km.) For a tenth of a degree, 34.4 is the worst, with 10K cities.

(Yes, I enjoy this kind of data puzzle.)

And, I see that you handle the dateline and poles. I don't think you can improve on having them as a special case.

(I have only glanced at the formulas and constants.)

Geohash and Z-order indexing help. But they have a hiccup in that you need to check up to 4 areas around the target -- It's like not realizing that the integers 199999 and 200000 are really close to each other, in spite of the first digit of each is different.

"User passes in zip code or city name" -- that's a point query into one of two simple tables. (Except that there can be dups -- over 320 each of "san jose" and "san antonio". Pretty far down the list is the first non-Spanish name: "victoria", with only 144 cities.)

Second, my implementation... (It has some similarities to yours.)

http://mysql.rjweb.org/doc.php/latlng

This improves on performance by using PARTITIONing to keeping the bounding box down to roughly a square, instead of a stripe or wedge. If you are looking for the 5 nearest, my algorithm will rarely touch more than a few dozen rows, and those rows will be 'clustered' in a small number of blocks, thereby keeping the number of disk hits very low.

A critical thing in my design is to have all the necessary columns in the one table. Once you have found the nearest 5, you can go off to other tables to get ancillary things (phone number, etc).

As for zip codes, turn them into lat/lon before starting the search for the 5 nearest.

A join inside the algorithm is very likely to destroy the performance.

like image 169
Rick James Avatar answered Oct 19 '22 18:10

Rick James


16K rows isn't really that much.

I have a cities table with 3.1M rows (data taken from https://www.maxmind.com/de/free-world-cities-database). I have created a "fake" locations table with 16K distinct random cityIds and some dummy data. I use one column with POINT data type instead of latitude and longitude. And this is what I get from quite simple queries on MySQL 5.7.18:

select l.*, c.*, st_distance(point(-0.127758, 51.507351), c.geoPoint) dist
from locations l
join cities c using (cityId)
order by dist
limit 5

Execution time is ~ 70ms.

That can be improved with a subquery:

select l.*, c.*, x.dist
from (
    select l.locationId, st_distance(point(-0.127758, 51.507351), c.geoPoint) dist
    from locations l
    join cities c using (cityId)
    order by dist
    limit 5
) x
join locations l using(locationId)
join cities c using(cityId)

Execution time: ~ 40ms

If you store the geoPoint (redundantly) in the locations table, you can avoid the join with the cities table.

select l.*, st_distance(point(-0.127758, 51.507351), l.geoPoint) dist
from locations l
order by dist
limit 5

Execution time: ~ 17ms

You can still join the cities table to the subquery without losing performance.

Note that all those queries will calculate the distance for all 16K rows and sort them. But the performance might though be enough for you.

If that is not fast enough or the locations table will grow over the time or if you want to search in the big table, you can still do something similar as you do with your procedure using a SPATIAL INDEX on geoPoint and MBRWithin() or MBRContains().

The algorithm:

  • Define a small polygon around the users location.
  • Increase the size of the polygon in a loop until it contains at least 5 locations.
  • Use the locations within the polygon to select 5 nearest.

Note that depending on what type of polygon you use, you might need to increase the size once again after you've found one with 5 locations. For example - if you use a square (simple implementation), you should double the size (increase the length by factor sqrt(2)), to be absolutly sure that you don't miss a location outside the square, which is closer than the 5th location within the square. This is because a square is not a circle. But if you use an octagon, you might say - this is circle enough - and skip the last step.

This might be not the best algorithm. But it is quite simple to implement and should scale well enough.

like image 2
Paul Spiegel Avatar answered Oct 19 '22 19:10

Paul Spiegel