I am interested in algorithm in T-SQL calculating Levenshtein distance.
The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.
The SOUNDEX() function returns a four-character code to evaluate the similarity of two expressions. Note: The SOUNDEX() converts the string to a four-character code based on how the string sounds when spoken.
The Levenshtein distance used as a metric provides a boost to accuracy of an NLP model by verifying each named entity in the entry. The vector search solution does a good job, and finds the most similar entry as defined by the vectorization.
A technique of finding the strings that match a pattern approximately (rather than exactly). Users / Reviewers often capture names inaccurately.
I implemented the standard Levenshtein edit distance function in TSQL with several optimizations that improves the speed over the other versions I'm aware of. In cases where the two strings have characters in common at their start (shared prefix), characters in common at their end (shared suffix), and when the strings are large and a max edit distance is provided, the improvement in speed is significant. For example, when the inputs are two very similar 4000 character strings, and a max edit distance of 2 is specified, this is almost three orders of magnitude faster than the edit_distance_within
function in the accepted answer, returning the answer in 0.073 seconds (73 milliseconds) vs 55 seconds. It's also memory efficient, using space equal to the larger of the two input strings plus some constant space. It uses a single nvarchar "array" representing a column, and does all computations in-place in that, plus some helper int variables.
Optimizations:
Here is the code (updated 1/20/2014 to speed it up a bit more):
-- ============================================= -- Computes and returns the Levenshtein edit distance between two strings, i.e. the -- number of insertion, deletion, and sustitution edits required to transform one -- string to the other, or NULL if @max is exceeded. Comparisons use the case- -- sensitivity configured in SQL Server (case-insensitive by default). -- -- Based on Sten Hjelmqvist's "Fast, memory efficient" algorithm, described -- at http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm, -- with some additional optimizations. -- ============================================= CREATE FUNCTION [dbo].[Levenshtein]( @s nvarchar(4000) , @t nvarchar(4000) , @max int ) RETURNS int WITH SCHEMABINDING AS BEGIN DECLARE @distance int = 0 -- return variable , @v0 nvarchar(4000)-- running scratchpad for storing computed distances , @start int = 1 -- index (1 based) of first non-matching character between the two string , @i int, @j int -- loop counters: i for s string and j for t string , @diag int -- distance in cell diagonally above and left if we were using an m by n matrix , @left int -- distance in cell to the left if we were using an m by n matrix , @sChar nchar -- character at index i from s string , @thisJ int -- temporary storage of @j to allow SELECT combining , @jOffset int -- offset used to calculate starting value for j loop , @jEnd int -- ending value for j loop (stopping point for processing a column) -- get input string lengths including any trailing spaces (which SQL Server would otherwise ignore) , @sLen int = datalength(@s) / datalength(left(left(@s, 1) + '.', 1)) -- length of smaller string , @tLen int = datalength(@t) / datalength(left(left(@t, 1) + '.', 1)) -- length of larger string , @lenDiff int -- difference in length between the two strings -- if strings of different lengths, ensure shorter string is in s. This can result in a little -- faster speed by spending more time spinning just the inner loop during the main processing. IF (@sLen > @tLen) BEGIN SELECT @v0 = @s, @i = @sLen -- temporarily use v0 for swap SELECT @s = @t, @sLen = @tLen SELECT @t = @v0, @tLen = @i END SELECT @max = ISNULL(@max, @tLen) , @lenDiff = @tLen - @sLen IF @lenDiff > @max RETURN NULL -- suffix common to both strings can be ignored WHILE(@sLen > 0 AND SUBSTRING(@s, @sLen, 1) = SUBSTRING(@t, @tLen, 1)) SELECT @sLen = @sLen - 1, @tLen = @tLen - 1 IF (@sLen = 0) RETURN @tLen -- prefix common to both strings can be ignored WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1)) SELECT @start = @start + 1 IF (@start > 1) BEGIN SELECT @sLen = @sLen - (@start - 1) , @tLen = @tLen - (@start - 1) -- if all of shorter string matches prefix and/or suffix of longer string, then -- edit distance is just the delete of additional characters present in longer string IF (@sLen <= 0) RETURN @tLen SELECT @s = SUBSTRING(@s, @start, @sLen) , @t = SUBSTRING(@t, @start, @tLen) END -- initialize v0 array of distances SELECT @v0 = '', @j = 1 WHILE (@j <= @tLen) BEGIN SELECT @v0 = @v0 + NCHAR(CASE WHEN @j > @max THEN @max ELSE @j END) SELECT @j = @j + 1 END SELECT @jOffset = @max - @lenDiff , @i = 1 WHILE (@i <= @sLen) BEGIN SELECT @distance = @i , @diag = @i - 1 , @sChar = SUBSTRING(@s, @i, 1) -- no need to look beyond window of upper left diagonal (@i) + @max cells -- and the lower right diagonal (@i - @lenDiff) - @max cells , @j = CASE WHEN @i <= @jOffset THEN 1 ELSE @i - @jOffset END , @jEnd = CASE WHEN @i + @max >= @tLen THEN @tLen ELSE @i + @max END WHILE (@j <= @jEnd) BEGIN -- at this point, @distance holds the previous value (the cell above if we were using an m by n matrix) SELECT @left = UNICODE(SUBSTRING(@v0, @j, 1)) , @thisJ = @j SELECT @distance = CASE WHEN (@sChar = SUBSTRING(@t, @j, 1)) THEN @diag --match, no change ELSE 1 + CASE WHEN @diag < @left AND @diag < @distance THEN @diag --substitution WHEN @left < @distance THEN @left -- insertion ELSE @distance -- deletion END END SELECT @v0 = STUFF(@v0, @thisJ, 1, NCHAR(@distance)) , @diag = @left , @j = case when (@distance > @max) AND (@thisJ = @i + @lenDiff) then @jEnd + 2 else @thisJ + 1 end END SELECT @i = CASE WHEN @j > @jEnd + 1 THEN @sLen + 1 ELSE @i + 1 END END RETURN CASE WHEN @distance <= @max THEN @distance ELSE NULL END END
As mentioned in the comments of this function, the case sensitivity of the character comparisons will follow the collation that's in effect. By default, SQL Server's collation is one that will result in case insensitive comparisons. One way to modify this function to always be case sensitive would be to add a specific collation to the two places where strings are compared. However, I have not thoroughly tested this, especially for side effects when the database is using a non-default collation. These are how the two lines would be changed to force case sensitive comparisons:
-- prefix common to both strings can be ignored WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS)
and
SELECT @distance = CASE WHEN (@sChar = SUBSTRING(@t, @j, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS) THEN @diag --match, no change
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With