How does an application perform a proximity search? For example, a user types in a postal code, then the application lists all the businesses within 20 miles ordered by proximity.
I want to build something like that in PHP and MySQL. Is this approach correct?
Is this OK? In step 3, I'm going to calculate the proximity for every query. Is it better to have a PROXIMITY table that lists the distance between every business and a few reference locations?
Proximity searching goes beyond the simple matching of words by adding the constraint of proximity and is generally regarded as a form of advanced search. For example, a search could be used to find "red brick house", and match phrases such as "red house of brick" or "house made of red brick".
Proximity is the search technique used to find two words next to, near, or within a specified distance of each other within a document. Using such search operators may result in more satisfactory results that are more relevant to the research needs than by just typing in desired keywords.
Proximity searching allow you to specify how close a search term appears in relation to another term contained in the of resources you find. Proximity operators are shorthand notations used during a search that usually has a number to indicate how close search terms should appear.
To do a proximity search, put all key terms that you want in the query within quotation marks, followed by the tilde ~ and the numerical value that represents the proximity you wish to search.
If there are enough records for speed to matter, here's a way to index them ahead of time.
Define a grid of bins about 20 miles on a side. Store the bin number with each store's record. At search time, compute the numbers of all bins that intersect a 20-mile radius from your search point. Then retrieve all stores in any of those bins, and proceed as before.
We use this to do many thousands of points. It is important if you are performing this in SQL to have an index on the Latitude and Longitude column. We tried doing this in SQL 2008 with spatial indexes but we really didn't see the performance increase we expected. Though if you want to calculate within a certain distance from a ZIP you need to think about if you are going to use the ZIP centroid or a polygon representation of the ZIP code.
Haversine forumla is a good place to start.
We have not had performance problems calculating the distance on the fly, we do calculate it ahead of time for some applications where we know the points ahead of time and there are going to be millions of records.
SELECT
[DistanceRadius]=
69.09 *
DEGREES(
ACOS(
SIN( RADIANS(latitude) )*SIN( RADIANS(@ziplat) )
+
COS( RADIANS(latitude) )*COS( RADIANS(@ziplat) )
*
COS( RADIANS(longitude - (@ziplon)) )
)
)
,*
FROM
table
) sub
WHERE
sub.DistanceRadius < @radius
We do this for about 1200 locations. I would just use the Haversine formula on the fly although depending on you application, it might be better to store it in PHP instead of SQL. (Our implementation is in .net so your milage may vary).
Really our biggest drawback with the way we implemented it, is that every calculation (up until recently) had to be calculated on the data tier which was painfully slow (when I say slow, I really mean non-instantaneous it took a second or so), but that was due to the fact that it had to calculate the distance for all 1200 locations based on the supplied zip code.
Depending on the route you choose, there are ways of speeding up the number distance calculations, by looking at the longitude and latitude and removing the ones outside of a predefined range (for example if you are looking at all address within 20 miles there is a longitude range you can calculate which all addresses have to fall in to be 20 miles away.) That can speed up you query if need be.
We actually looked at storing all possible combinations in our database. In reality it sounds like it could be a large data store, but it's really not in the big scope of things. With indexes it can be quite fast, and you don't have to worry about algorithm optimization etc. We decided against it, because we had the equation in C#, and it allowed us to cache the information necessary to do all the calculations in the business tier. Either will work just fine, it's just a matter of what your preference is.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With