Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Levenshtein distance in T-SQL

I am interested in algorithm in T-SQL calculating Levenshtein distance.

like image 546
Alexander Prokofyev Avatar asked Feb 18 '09 11:02

Alexander Prokofyev


People also ask

How Levenshtein distance is calculated?

The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.

What is the use of Soundex () in SQL?

The SOUNDEX() function returns a four-character code to evaluate the similarity of two expressions. Note: The SOUNDEX() converts the string to a four-character code based on how the string sounds when spoken.

Is Levenshtein distance NLP?

The Levenshtein distance used as a metric provides a boost to accuracy of an NLP model by verifying each named entity in the entry. The vector search solution does a good job, and finds the most similar entry as defined by the vectorization.

What is a fuzzy search in SQL?

A technique of finding the strings that match a pattern approximately (rather than exactly). Users / Reviewers often capture names inaccurately.


1 Answers

I implemented the standard Levenshtein edit distance function in TSQL with several optimizations that improves the speed over the other versions I'm aware of. In cases where the two strings have characters in common at their start (shared prefix), characters in common at their end (shared suffix), and when the strings are large and a max edit distance is provided, the improvement in speed is significant. For example, when the inputs are two very similar 4000 character strings, and a max edit distance of 2 is specified, this is almost three orders of magnitude faster than the edit_distance_within function in the accepted answer, returning the answer in 0.073 seconds (73 milliseconds) vs 55 seconds. It's also memory efficient, using space equal to the larger of the two input strings plus some constant space. It uses a single nvarchar "array" representing a column, and does all computations in-place in that, plus some helper int variables.

Optimizations:

  • skips processing of shared prefix and/or suffix
  • early return if larger string starts or ends with entire smaller string
  • early return if difference in sizes guarantees max distance will be exceeded
  • uses only a single array representing a column in the matrix (implemented as nvarchar)
  • when a max distance is given, time complexity goes from (len1*len2) to (min(len1,len2)) i.e. linear
  • when a max distance is given, early return as soon as max distance bound is known not to be achievable

Here is the code (updated 1/20/2014 to speed it up a bit more):

-- ============================================= -- Computes and returns the Levenshtein edit distance between two strings, i.e. the -- number of insertion, deletion, and sustitution edits required to transform one -- string to the other, or NULL if @max is exceeded. Comparisons use the case- -- sensitivity configured in SQL Server (case-insensitive by default). --  -- Based on Sten Hjelmqvist's "Fast, memory efficient" algorithm, described -- at http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm, -- with some additional optimizations. -- ============================================= CREATE FUNCTION [dbo].[Levenshtein](     @s nvarchar(4000)   , @t nvarchar(4000)   , @max int ) RETURNS int WITH SCHEMABINDING AS BEGIN     DECLARE @distance int = 0 -- return variable           , @v0 nvarchar(4000)-- running scratchpad for storing computed distances           , @start int = 1      -- index (1 based) of first non-matching character between the two string           , @i int, @j int      -- loop counters: i for s string and j for t string           , @diag int          -- distance in cell diagonally above and left if we were using an m by n matrix           , @left int          -- distance in cell to the left if we were using an m by n matrix           , @sChar nchar      -- character at index i from s string           , @thisJ int          -- temporary storage of @j to allow SELECT combining           , @jOffset int      -- offset used to calculate starting value for j loop           , @jEnd int          -- ending value for j loop (stopping point for processing a column)           -- get input string lengths including any trailing spaces (which SQL Server would otherwise ignore)           , @sLen int = datalength(@s) / datalength(left(left(@s, 1) + '.', 1))    -- length of smaller string           , @tLen int = datalength(@t) / datalength(left(left(@t, 1) + '.', 1))    -- length of larger string           , @lenDiff int      -- difference in length between the two strings     -- if strings of different lengths, ensure shorter string is in s. This can result in a little     -- faster speed by spending more time spinning just the inner loop during the main processing.     IF (@sLen > @tLen) BEGIN         SELECT @v0 = @s, @i = @sLen -- temporarily use v0 for swap         SELECT @s = @t, @sLen = @tLen         SELECT @t = @v0, @tLen = @i     END     SELECT @max = ISNULL(@max, @tLen)          , @lenDiff = @tLen - @sLen     IF @lenDiff > @max RETURN NULL      -- suffix common to both strings can be ignored     WHILE(@sLen > 0 AND SUBSTRING(@s, @sLen, 1) = SUBSTRING(@t, @tLen, 1))         SELECT @sLen = @sLen - 1, @tLen = @tLen - 1      IF (@sLen = 0) RETURN @tLen      -- prefix common to both strings can be ignored     WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1))          SELECT @start = @start + 1     IF (@start > 1) BEGIN         SELECT @sLen = @sLen - (@start - 1)              , @tLen = @tLen - (@start - 1)          -- if all of shorter string matches prefix and/or suffix of longer string, then         -- edit distance is just the delete of additional characters present in longer string         IF (@sLen <= 0) RETURN @tLen          SELECT @s = SUBSTRING(@s, @start, @sLen)              , @t = SUBSTRING(@t, @start, @tLen)     END      -- initialize v0 array of distances     SELECT @v0 = '', @j = 1     WHILE (@j <= @tLen) BEGIN         SELECT @v0 = @v0 + NCHAR(CASE WHEN @j > @max THEN @max ELSE @j END)         SELECT @j = @j + 1     END      SELECT @jOffset = @max - @lenDiff          , @i = 1     WHILE (@i <= @sLen) BEGIN         SELECT @distance = @i              , @diag = @i - 1              , @sChar = SUBSTRING(@s, @i, 1)              -- no need to look beyond window of upper left diagonal (@i) + @max cells              -- and the lower right diagonal (@i - @lenDiff) - @max cells              , @j = CASE WHEN @i <= @jOffset THEN 1 ELSE @i - @jOffset END              , @jEnd = CASE WHEN @i + @max >= @tLen THEN @tLen ELSE @i + @max END         WHILE (@j <= @jEnd) BEGIN             -- at this point, @distance holds the previous value (the cell above if we were using an m by n matrix)             SELECT @left = UNICODE(SUBSTRING(@v0, @j, 1))                  , @thisJ = @j             SELECT @distance =                  CASE WHEN (@sChar = SUBSTRING(@t, @j, 1)) THEN @diag                    --match, no change                      ELSE 1 + CASE WHEN @diag < @left AND @diag < @distance THEN @diag    --substitution                                    WHEN @left < @distance THEN @left                    -- insertion                                    ELSE @distance                                        -- deletion                                 END    END             SELECT @v0 = STUFF(@v0, @thisJ, 1, NCHAR(@distance))                  , @diag = @left                  , @j = case when (@distance > @max) AND (@thisJ = @i + @lenDiff) then @jEnd + 2 else @thisJ + 1 end         END         SELECT @i = CASE WHEN @j > @jEnd + 1 THEN @sLen + 1 ELSE @i + 1 END     END     RETURN CASE WHEN @distance <= @max THEN @distance ELSE NULL END END 

As mentioned in the comments of this function, the case sensitivity of the character comparisons will follow the collation that's in effect. By default, SQL Server's collation is one that will result in case insensitive comparisons. One way to modify this function to always be case sensitive would be to add a specific collation to the two places where strings are compared. However, I have not thoroughly tested this, especially for side effects when the database is using a non-default collation. These are how the two lines would be changed to force case sensitive comparisons:

    -- prefix common to both strings can be ignored     WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS)  

and

            SELECT @distance =                  CASE WHEN (@sChar = SUBSTRING(@t, @j, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS) THEN @diag                    --match, no change 
like image 90
hatchet - done with SOverflow Avatar answered Sep 24 '22 08:09

hatchet - done with SOverflow