Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to calucate the edit distance between a regexp and a string?

If so, please explain how.

Re: what is distance -- "The distance between two strings is defined as the minimal number of edits required to convert one into the other."

For example, xyz to XYZ would take 3 edits, so the string xYZ is closer to XYZ and xyz.

If the pattern is [0-9]{3} or for instance 123, then a23 would be closer to the pattern than ab3.

How can you find the shortest distance between a regexp and a non-matching string?

The above is the Damerau–Levenshtein distance algorithm.

like image 363
blunders Avatar asked Oct 20 '10 02:10

blunders


People also ask

What is the edit distance between two strings?

The Levenshtein distance between two strings is the number of single character deletions, insertions, or substitutions required to transform one string into the other. This is also known as the edit distance. Vladimir Levenshtein is a Russian mathematician who published this notion in 1966.

Is Levenshtein distance same as edit distance?

Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.

How do you find the distance between two strings?

There are several ways to measure the distance between two strings. The simplest one is to use hamming distance to find the number of mismatch between two strings. However, the two strings must have the same length.


1 Answers

You can use Finite State Machines to do this efficiently (that is, linear in time). If you use a transducer, you can even write the specification of the transformation fairly compactly and do far more nuanced transformations than simply inserts or deletes - see wikipedia for Finite State Transducer as a starting point, and software such as the FSA toolkit or FSA6 (which has a not entirely stable web-demo) too. There are lots of libraries for FSA manipulation; I don't want to suggest the previous two are your only or best options, just two I've heard of.

If, however, you merely want the efficient, approximate searching, a less flexibly but already-implemented-for-you option exists: TRE, which has an approximate matching function that returns the cost of the match - i.e., the distance to the match, from your perspective.

like image 103
Eamon Nerbonne Avatar answered Oct 15 '22 22:10

Eamon Nerbonne