Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Edit distance between two regular expression

I faced this question in an interview:

Given two regular expression, compute the edit distance between them. The edit distance being defined as the smallest edit distance between any two strings generated by the two regular expressions respectively.

Formally, we are looking for d(L1,L2) = min { d(x,y) | x from L1, y from L2 }, where L1 and L2 are the languages generated by the two regular expressions.

I was not able to solve it during interviews. Even now I don't have any clue how to solve it. Any ideas?

I think this is same as http://www.spoj.com/problems/AMR10B/

like image 386
Quixotic Avatar asked Apr 30 '15 08:04

Quixotic


People also ask

How do you find the edit distance between two strings?

Delete 'm'th character of str1 and compute edit distance between 'm-1' characters of str1 and 'n' characters of str2. For this computation, we simply have to do - (1 + array[m-1][n]) where 1 is the cost of delete operation and array[m-1][n] is edit distance between 'm-1' characters of str1 and 'n' characters of str2.

What is the edit distance between?

Edit Distance or Levenstein distance (the most common) is a metric to calculate the similarity between a pair of sequences. The distance between two sequences is measured as the number of edits (insertion, deletion, or substitution) that are required to convert one sequence to another.

What is normalized edit distance?

Abstract: The normalized edit distance is one of the distances derived from the edit distance. It is useful in some applications because it takes into account the lengths of the two strings compared. The normalized edit distance is not defined in terms of edit operations but rather in terms of the edit path.

Is edit distance NP complete?

The main results presented here are: The problem of edit distance with moves is NP-complete. A “recursive” sequence of moves can be simulated with at most a constant factor increase by a non-recursive sequence.


1 Answers

There's finite state machines that represent the two languages. Let's say the first language has states S[1], S[2], ..., S[N1] and transitions c: S[i]->S[j] (meaning state i goes to state j under input character c), and T[1], T[2], ... T[N2] for the second language (with its own set of transitions).

Now, you can construct the weighted multi-graph with nodes being pairs of states, and edges between pairs (S[i1], T[i2]) -> (S[j1], T[j2]) if any of these four cases hold:

  • There's c: S[i1] -> S[j1] and i2 = j2. This has weight 1
  • There's c: T[i2] -> T[j2] and i1 = j1. This has weight 1
  • There's c: S[i1] -> S[j1] and c: T[i2] -> T[j2]. This has weight 0
  • There's c: S[i1] -> S[j1] and d: T[i2] -> T[j2]. This has weight 1

Then, finding the lowest weight path from the pair of start states to any pair of accepting states gives you the minimal edit distance.

like image 158
Paul Hankin Avatar answered Oct 19 '22 20:10

Paul Hankin