Speeding up levenshtein / similar_text in PHP

Question

I am currently using similar_text to compare a string against a list of ~50,000 which works although due to the number of comparisons it's very slow. It takes around 11 minutes to compare ~500 unique strings.

Before running this I do check the databases to see whether it has been processed in the past so everytime after the inital run it's close to instant.

I'm sure using levenshtein would be slightly faster and the LevenshteinDistance function someone posted in the manual looks interesting. Am I missing something that could make this significantly faster?

DanCake · Accepted Answer

In the end, both levenshtein and similar_text were both too slow with the number of strings it had to go through, even with lots of checks and only using them one of them as a last resort.

As an experiment, I ported some of the code to C# to see how much faster it would be over interperated code. It ran in about 3 minutes with the same dataset.

Next I added an extra field to the table and used the double metaphone PECL extension to generate keys for each row. The results were good although since some included numbers this caused duplicates. I guess I could then have run each one through the above functions but decided not to.

In the end I opted for the simplest approach, MySQLs full text which worked very well. Occasionally there are mistakes although they are easy to detect and correct. Also it runs very fast, in around 3-4 seconds.

Mitch Wheat · Answer

Perhaps you could 'short-circuit' some checks by first comparing your string for an exact match (and by first comparing if length identical), and if it is skip the more expensive similar_text call.

As @jason noted, an O(N^3) algorithm is never going to be a good choice.

Speeding up levenshtein / similar_text in PHP

Tags:

php

levenshtein-distance

similarity

DanCake

2 Answers

DanCake

Mitch Wheat

Recent Activity

Donate For Us

Speeding up levenshtein / similar_text in PHP

Tags:

php

levenshtein-distance

similarity

DanCake

2 Answers

DanCake

Mitch Wheat

Related questions

Recent Activity

Donate For Us