In the source of Data/ByteString.hs
it says that the function findSubstrings
has been deprecated in favor of breakSubstring
. However I think the findSubstrings
which was implemented using the KMP algorithm is much more efficient than the algorithm used in breakSubstring
which is a naive one. Anybody has any idea why this has been done ?
Here's the old implementation:
{-# DEPRECATED findSubstrings "findSubstrings is deprecated in favour of breakSubstring." #-}
{-
{- This function uses the Knuth-Morris-Pratt string matching algorithm. -}
findSubstrings pat@(PS _ _ m) str@(PS _ _ n) = search 0 0
where
patc x = pat `unsafeIndex` x
strc x = str `unsafeIndex` x
-- maybe we should make kmpNext a UArray before using it in search?
kmpNext = listArray (0,m) (-1:kmpNextL pat (-1))
kmpNextL p _ | null p = []
kmpNextL p j = let j' = next (unsafeHead p) j + 1
ps = unsafeTail p
x = if not (null ps) && unsafeHead ps == patc j'
then kmpNext Array.! j' else j'
in x:kmpNextL ps j'
search i j = match ++ rest -- i: position in string, j: position in pattern
where match = if j == m then [(i - j)] else []
rest = if i == n then [] else search (i+1) (next (strc i) j + 1)
next c j | j >= 0 && (j == m || c /= patc j) = next c (kmpNext Array.! j)
| otherwise = j
-}
And here's the new naive one:
findSubstrings :: ByteString -- ^ String to search for.
-> ByteString -- ^ String to seach in.
-> [Int]
findSubstrings pat str
| null pat = [0 .. length str]
| otherwise = search 0 str
where
STRICT2(search)
search n s
| null s = []
| pat `isPrefixOf` s = n : search (n+1) (unsafeTail s)
| otherwise = search (n+1) (unsafeTail s)
The reason for the change was that the implementation of KMP was actually more inefficient than the naive version, which is implemented with an eye on performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With