Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Levenshtein search

I work on a site which sells let's say stuff and offers a "vendors search". On this search you enter your city, or postal code, or region and a distance (in km or miles) then the site gives you a list of vendors.

To do that, I have a database with the vendors. In the form to save these vendors, you enter their full address and when you click on the save button, a request to google maps is made in order to get their latitude and longitude.

When someone does a search, I look on a table where I store all the search terms and their lat/lng. This table looks like

+--------+-------+------+
| term   | lat   | lng  |
+--------+-------+------+

So the first query is something very simple

select lat, lng from my_search_table where term = "the term"

If I find a result, I then search with a nice method for all the vendors in the range the visitor wants and print the result on a map.

If I don't find a result, I search with a levenshtein function because people writing bruxelle or bruxeles instead of bruxelles is something really common and I don't want to make a request to google maps all the time (I also have a "how many time searched" column in my table to get some stats)

So I request my_search_time with no where clause and loop through all results to get the smallest levensthein distance. If the smallest result is greater than 2, I request coordinates from google maps.

Here is my problem. For some countries (we have several sites all around the world), my_search_table has 15-20k+ entries... and php doesn't (really) like looping on such data (which I perfectly understand) and my request falls under the php timeout. I could increase this timeout but the problem will be the same in a few months.

So I tried a levensthein MySQL function (found on stackoverflow btw) but it's also very slow.

So my question is "is there any way to make this search fast even on very large datasets ?"

like image 390
bmichotte Avatar asked Mar 02 '13 09:03

bmichotte


2 Answers

My suggestion is based on three things:

  • First, your data set is big. That means - it's: big enough to reject the idea of "select all" + "run levenshtein() in PHP application"
  • Second, you have control over your database. So you can adjust some architecture-related things
  • Finally, performance of SELECT queries is the most important thing, while performance for adding new data doesn't matter.

The thing is you can not perform fast levenshtein search because levenshtein itself is very slow. I mean, calculating levenshtein distance is a slow thing. Thus, you'll not be able to resolve the issue with only "smart search". You'll have to prepare some data.

Possible solution will be: create some group index and assign it during adding/updating data. That means - you'll store additional column which will store some hash (numeric, for example). When adding new data, you'll:

  • Perform search with levenshtein distance (for that you may either use your application or that function which you've (already mentioned) over all records in your table against inserted data
  • Set group index for new row to value of index which found rows in previous step have.
  • If nothing found, set some new group index value (it' the first row and there are no similar rows yet) - which will be different from any group index values that already present in table

To search desired rows, you'll need just select rows with same group index value. That means: your select queries will be very fast. But - yes, this will cause extremely huge overhead when adding/changing your data. Thus, it isn't applicable for case, when performance of updating/inserting matters.

like image 104
Alma Do Avatar answered Sep 28 '22 19:09

Alma Do


You could try MySQL function SOUNDS LIKE

SELECT lat, lng FROM my_search_table WHERE term SOUNDS LIKE "the term"
like image 30
david strachan Avatar answered Sep 28 '22 19:09

david strachan