Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to write an implementation of the Levenshtein metric with matrices

Tags:

haskell

The following is what I've hacked together so far using "matrix memory" in implementing Левенште́йн. I'm trying to use Haskell for just about everything now so that I really learn it. Concepts I haven't really grasped yet include monad transformers, the state monad (working on it), and lenses.

import Data.Matrix
import Control.Monad.State
import Control.Applicative


type RecState = Int

-- Set up the first row
setLeftCol :: String -> Matrix Int -> Maybe (Matrix Int)
setLeftCol str mat = let strLength = length str + 1
                     in foldr helper (Just mat) [1..strLength]
  where
    helper :: Int -> Maybe (Matrix Int) -> Maybe (Matrix Int)
    helper value matrixMon = (\m -> safeSet (value-1) (value,1) m) =<< matrixMon

-- Encapsulate a transposition in a Maybe context
transposeM :: Matrix a -> Maybe (Matrix a)
transposeM mat = Just (transpose mat)

-- Set up the first column
setTopRow  :: String -> Matrix Int -> Maybe (Matrix Int)
setTopRow str mat = let mat' = return mat
                    in mat' >>= transposeM >>= (setLeftCol str) >>= transposeM

-- Generate coordinates
coords :: Int -> Int -> [(Int,Int)]
coords width height = [(x,y) | x <- [1..(width+1)], y <- [1..(height+1)]]

safeFst :: Maybe (Int,Int) -> Maybe Int
safeFst tuple = case tuple of
                  Just (x,y) -> Just x
                  Nothing    -> Nothing

safeSnd :: Maybe (Int,Int) -> Maybe Int
safeSnd tuple = case tuple of
                  Just (x,y) -> Just y
                  Nothing    -> Nothing

distance :: Matrix Int -> State RecState (Matrix Int)
distance matrix = do
  index <- get
  let coordinate = coordinates !! index
      i = fst coordinate
      j = snd coordinate
  if index == size then
    put matrix
    return $ getElem i j matrix
  else do
    put (index + 1)
    let ch1 = w1 !! (i - 1)
        ch2 = w2 !! (j - 1)
        cost = if ch1 /= ch2 then 1 else 0
        entry1 = (getElem (i - 1) j matrix) + 1
        entry2 = (getElem i (j - 1) matrix) + 1
        entry3 = (getElem (i - 1) (j - 1) matrix) + cost
    return $ distance $ setElem (minimum [entry1,entry2,entry3]) coordinate matrix


-- Compute the Levenshtein distance on two strings.
levenshtein :: String -> String -> Int
levenshtein "" "" = 0
levenshtein "" w2 = length w2
levenshtein w1 "" = length w1
levenshtein w1 w2 = let lenW1 = length w1
                        lenW2 = length w2
                        size = lenW1 * lenW2
                        matrix = Just $ zero (lenW1 + 1) (lenW2 + 1)
                        matrix' = matrix >>= setLeftCol w1 >>= setTopRow w2
                        coordinates = coords lenW1 lenW2
                    in execState (distance <$> matrix') (lenW1 + 2)

showResults :: Show r => r -> IO ()
showResults = putStrLn . show

showLevenshtein :: String -> String -> IO ()
showLevenshtein = showResults . levenshtein

My first question is how do I organize the distance function with levenshtein? I first put it in a where clause following the line beginning with in execState.... However, I found that neither size nor coordinates were accessible within this function, as they're defined in the original let statement of levenshtein.

Also feel free to comment on any other ideas I've tried here.

like image 532
bjd2385 Avatar asked Mar 06 '23 03:03

bjd2385


1 Answers

There's a formula for solving dynamic programming problems like this in Haskell.

  1. Write the solution in terms of a recursive formula
  2. Abstract over the recursive calls by rewriting a function a -> b as (a -> b) -> (a -> b) with no recursive calls.
  3. Redirect the recursive calls to memoization through some point in memory - a let binding, a list, an array, a memotrie, etc.

For levenshtien distance an array is appropriate.

Recursive formula

Start by writing the levenshtien distance formula recursively in terms of itself

lev :: Eq a => [a] -> [a] -> (Int, Int) -> Int
lev a b (0, 0) = 0
lev a b (0, j) = j
lev a b (i, 0) = i
lev a b (i, j) = (lev a b (i-1, j) + 1) `min` (lev a b (i, j-1) + 1) `min` (lev a b (i-1, j-1) + if a !! (i - 1) == b !! (j - 1) then 0 else 1)

The levenshtien distance for two strings is the distance calculated all the way to the last characters

levenshtien :: Eq a => [a] -> [a] -> Int
levenshtien a b = lev a b upperBound
  where
    upperBound = (length a, length b)

Abstact over recursive calls

Then replace the recursive calls to calls to some other function f that somehow implements the rest of levenshtien distance.

lev' :: Eq a => [a] -> [a] -> ((Int, Int) -> Int) -> (Int, Int) -> Int
lev' a b f (0, 0) = 0
lev' a b f (0, j) = j
lev' a b f (i, 0) = i
lev' a b f (i, j) = (f (i-1, j) + 1) `min` (f (i, j-1) + 1) `min` (f (i-1, j-1) + if a !! (i - 1) == b !! (j - 1) then 0 else 1)

You can recover lev from lev' by using fix, which is defined as fix f = let x = f x in x

import Data.Function

lev :: Eq a => [a] -> [a] -> (Int, Int) -> Int
lev a b = fix (lev' a b)

Memoization via arrays

Finally you'll need a way to memoize the intermediate results in an array. I find the following to be an easier way to build an array than the functions in Data.Array.

import Data.Array

buildArray :: Ix i => (i, i) -> (i -> e) -> Array i e
buildArray bounds f = listArray bounds (f <$> range bounds)

We can memoize a function in an array by building an array holding some of the results and use the stored value from the array if the argument is in the array, and using the original function if it's not.

memoArray :: Ix i => (i, i) -> (i -> e) -> (i -> e)
memoArray bounds f = \i -> if inRange bounds i then arr ! i else f i
  where
    arr = buildArray bounds f

We can fix a function with some of its values memoized in an array by fixing the function composed with memoing some of its values.

fixArray :: Ix i => (i, i) -> ((i -> e) -> i -> e) -> (i -> e)
fixArray bounds f = fix (memoArray bounds . f)

Putting it all together

Finally we can rewrite levenshtien in terms of lev' and fixArray, memorizing all the important bits that would be re-used along the way.

levenshtien :: Eq a => [a] -> [a] -> Int
levenshtien a b = fixArray ((1, 1), upperBound) (lev' a b) upperBound
  where
    upperBound = (length a, length b)

Further improvements

  • get rid of the quadratic list accesses !! by replacing the lists with arrays
  • get rid of the quadratic memory usage by strictly folding up a one-dimensional array
like image 53
Cirdec Avatar answered Apr 30 '23 12:04

Cirdec