Levenshtein distance in T-SQL

1 Answers

I implemented the standard Levenshtein edit distance function in TSQL with several optimizations that improves the speed over the other versions I'm aware of. In cases where the two strings have characters in common at their start (shared prefix), characters in common at their end (shared suffix), and when the strings are large and a max edit distance is provided, the improvement in speed is significant. For example, when the inputs are two very similar 4000 character strings, and a max edit distance of 2 is specified, this is almost three orders of magnitude faster than the edit_distance_within function in the accepted answer, returning the answer in 0.073 seconds (73 milliseconds) vs 55 seconds. It's also memory efficient, using space equal to the larger of the two input strings plus some constant space. It uses a single nvarchar "array" representing a column, and does all computations in-place in that, plus some helper int variables.

Optimizations:

skips processing of shared prefix and/or suffix
early return if larger string starts or ends with entire smaller string
early return if difference in sizes guarantees max distance will be exceeded
uses only a single array representing a column in the matrix (implemented as nvarchar)
when a max distance is given, time complexity goes from (len1*len2) to (min(len1,len2)) i.e. linear
when a max distance is given, early return as soon as max distance bound is known not to be achievable

Here is the code (updated 1/20/2014 to speed it up a bit more):

-- ============================================= -- Computes and returns the Levenshtein edit distance between two strings, i.e. the -- number of insertion, deletion, and sustitution edits required to transform one -- string to the other, or NULL if @max is exceeded. Comparisons use the case- -- sensitivity configured in SQL Server (case-insensitive by default). --  -- Based on Sten Hjelmqvist's "Fast, memory efficient" algorithm, described -- at http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm, -- with some additional optimizations. -- ============================================= CREATE FUNCTION [dbo].[Levenshtein](     @s nvarchar(4000)   , @t nvarchar(4000)   , @max int ) RETURNS int WITH SCHEMABINDING AS BEGIN     DECLARE @distance int = 0 -- return variable           , @v0 nvarchar(4000)-- running scratchpad for storing computed distances           , @start int = 1      -- index (1 based) of first non-matching character between the two string           , @i int, @j int      -- loop counters: i for s string and j for t string           , @diag int          -- distance in cell diagonally above and left if we were using an m by n matrix           , @left int          -- distance in cell to the left if we were using an m by n matrix           , @sChar nchar      -- character at index i from s string           , @thisJ int          -- temporary storage of @j to allow SELECT combining           , @jOffset int      -- offset used to calculate starting value for j loop           , @jEnd int          -- ending value for j loop (stopping point for processing a column)           -- get input string lengths including any trailing spaces (which SQL Server would otherwise ignore)           , @sLen int = datalength(@s) / datalength(left(left(@s, 1) + '.', 1))    -- length of smaller string           , @tLen int = datalength(@t) / datalength(left(left(@t, 1) + '.', 1))    -- length of larger string           , @lenDiff int      -- difference in length between the two strings     -- if strings of different lengths, ensure shorter string is in s. This can result in a little     -- faster speed by spending more time spinning just the inner loop during the main processing.     IF (@sLen > @tLen) BEGIN         SELECT @v0 = @s, @i = @sLen -- temporarily use v0 for swap         SELECT @s = @t, @sLen = @tLen         SELECT @t = @v0, @tLen = @i     END     SELECT @max = ISNULL(@max, @tLen)          , @lenDiff = @tLen - @sLen     IF @lenDiff > @max RETURN NULL      -- suffix common to both strings can be ignored     WHILE(@sLen > 0 AND SUBSTRING(@s, @sLen, 1) = SUBSTRING(@t, @tLen, 1))         SELECT @sLen = @sLen - 1, @tLen = @tLen - 1      IF (@sLen = 0) RETURN @tLen      -- prefix common to both strings can be ignored     WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1))          SELECT @start = @start + 1     IF (@start > 1) BEGIN         SELECT @sLen = @sLen - (@start - 1)              , @tLen = @tLen - (@start - 1)          -- if all of shorter string matches prefix and/or suffix of longer string, then         -- edit distance is just the delete of additional characters present in longer string         IF (@sLen <= 0) RETURN @tLen          SELECT @s = SUBSTRING(@s, @start, @sLen)              , @t = SUBSTRING(@t, @start, @tLen)     END      -- initialize v0 array of distances     SELECT @v0 = '', @j = 1     WHILE (@j <= @tLen) BEGIN         SELECT @v0 = @v0 + NCHAR(CASE WHEN @j > @max THEN @max ELSE @j END)         SELECT @j = @j + 1     END      SELECT @jOffset = @max - @lenDiff          , @i = 1     WHILE (@i <= @sLen) BEGIN         SELECT @distance = @i              , @diag = @i - 1              , @sChar = SUBSTRING(@s, @i, 1)              -- no need to look beyond window of upper left diagonal (@i) + @max cells              -- and the lower right diagonal (@i - @lenDiff) - @max cells              , @j = CASE WHEN @i <= @jOffset THEN 1 ELSE @i - @jOffset END              , @jEnd = CASE WHEN @i + @max >= @tLen THEN @tLen ELSE @i + @max END         WHILE (@j <= @jEnd) BEGIN             -- at this point, @distance holds the previous value (the cell above if we were using an m by n matrix)             SELECT @left = UNICODE(SUBSTRING(@v0, @j, 1))                  , @thisJ = @j             SELECT @distance =                  CASE WHEN (@sChar = SUBSTRING(@t, @j, 1)) THEN @diag                    --match, no change                      ELSE 1 + CASE WHEN @diag < @left AND @diag < @distance THEN @diag    --substitution                                    WHEN @left < @distance THEN @left                    -- insertion                                    ELSE @distance                                        -- deletion                                 END    END             SELECT @v0 = STUFF(@v0, @thisJ, 1, NCHAR(@distance))                  , @diag = @left                  , @j = case when (@distance > @max) AND (@thisJ = @i + @lenDiff) then @jEnd + 2 else @thisJ + 1 end         END         SELECT @i = CASE WHEN @j > @jEnd + 1 THEN @sLen + 1 ELSE @i + 1 END     END     RETURN CASE WHEN @distance <= @max THEN @distance ELSE NULL END END

As mentioned in the comments of this function, the case sensitivity of the character comparisons will follow the collation that's in effect. By default, SQL Server's collation is one that will result in case insensitive comparisons. One way to modify this function to always be case sensitive would be to add a specific collation to the two places where strings are compared. However, I have not thoroughly tested this, especially for side effects when the database is using a non-default collation. These are how the two lines would be changed to force case sensitive comparisons:

    -- prefix common to both strings can be ignored     WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS)

and

            SELECT @distance =                  CASE WHEN (@sChar = SUBSTRING(@t, @j, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS) THEN @diag                    --match, no change

answered Sep 24 '22 08:09

hatchet - done with SOverflow

Related questions
                            
                                Do clustered indexes have to be unique?
                            
                                How to compare datetime with only date in SQL Server
                            
                                Understanding SQL Server LOCKS on SELECT queries
                            
                                How to view the stored procedure code in SQL Server Management Studio
                            
                                How to insert default values in SQL table?
                            
                                TSQL: How to convert local time to UTC? (SQL Server 2008)
                            
                                T-SQL loop over query results
                            
                                Get the records of last month in SQL server
                            
                                Understanding PIVOT function in T-SQL
                            
                                Simple DateTime sql query
                            
                                How to parse string into date?
                            
                                Database Naming Conventions by Microsoft?
                            
                                Update query using Subquery in Sql Server
                            
                                SQL Server: What are batching statements (i.e. using "GO") good for?
                            
                                How can I get the number of records affected by a stored procedure?
                            
                                T-SQL string replace in Update
                            
                                SQL Server Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >=
                            
                                Getting list of tables, and fields in each, in a database
                            
                                FULL OUTER JOIN vs. FULL JOIN
                            
                                SQL Server ORDER BY date and nulls last

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Levenshtein distance in T-SQL

Tags:

tsql

levenshtein-distance

edit-distance

Alexander Prokofyev

People also ask

1 Answers

hatchet - done with SOverflow

Recent Activity

Donate For Us