Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does cost mean in php's levenshtein function to compare strings?

I'm studying php's levenshtein function to create a search in a small redis instance to get matches even if there are typos in the submitted search term. While the most of it is quite self explaining, I'm struggling to get how to best use the three different cost parameters.

int levenshtein ( string $str1 , string $str2 , int $cost_ins , int $cost_rep , int $cost_del )

There's a short explanation in the documentation

A second variant will take three additional parameters that define the cost of insert, replace and delete operations. This is more general and adaptive than variant one, but not as efficient.

But that doesn't solve my not understanding. Can someone explain how I can use cost parameters to improve the results/performance?

like image 681
baao Avatar asked Jul 03 '15 19:07

baao


2 Answers

In Machine Learning, a cost function is the function that you're trying to minimize to achieve the best result. When the machine execute, for instance, Steps A, B and C, the Cost Function will calculate how much it 'costed' to execute those steps. The term cost is connected to a mathematical function that will evaluate the performance of your result.

When computer, for instance, execute the Steps B, C and A, the cost function will tell you if you got a better or worst result than the previous step.

Reading the documentation, you can see that

The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2.

That is your cost function: Minimize the distance in terms of replace, delete and insert.

Each time the algorithm has to perform one of those tasks, it adds up one cost of that operation. Those last 3 parameters lets you decide the value of each operation. At the end of the comparison, you will get a final value, namely cost of the function. If that value is smaller than a defined threshold, it's the same as assuming the function returned true. In case the result is bigger than your threshold, it means it costs more than what you allow for one string be equal to another.

like image 176
Marco Aurélio Deleu Avatar answered Oct 01 '22 06:10

Marco Aurélio Deleu


I dont know how it could help with what you are looking, but I can explain how those costs work.

By default, the cost of insert, replace and delete operations have value of 1.

It means

String A: hello String B: helloo

levenshtein($stringA, $stringB) = 1 because it takes 1 "insert" operation to make string A == string B. Also, because the costs of insert is 1. That why levenshtein return 1. If you set "insert costs" to 2, you get 2.

Same idea applies to replace and delete operations.

Remember that lower levenshtein number, the less operations it take to make string A == string B.

like image 33
Michael Nguyen Avatar answered Oct 01 '22 06:10

Michael Nguyen