Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is correct way to implement splitWith from "Real World Haskell?"

Tags:

haskell

I've been working my way through real world haskell and trying to do the exercises. I managed to implement a working version of splitWith from Chapter 4.5 exercise 2. I feel like this is not a very haskell way of doing things. Having to implement a new function with an accumulator seems very roundabout. Is there a more idiomatic way to do this, such as with fold? I looked at the documentation for foldl but I was left scratching my head as to how.

splitWith :: (a -> Bool) -> [a] -> [[a]]
splitWith _ [] = []
splitWith f a  = splitWithAcc f a []
  where 
    splitWithAcc :: (a -> Bool) -> [a] -> [[a]] -> [[a]]
    splitWithAcc f xs acc
      | null xs     = acc
      | f $ head xs = splitWithAcc f (dropWhile f xs) (acc ++ [takeWhile f xs])
      | otherwise   = splitWithAcc f (tail xs) acc

CLARIFICATION

Here is the text of the exercise:

Write a function splitWith that acts similarly to words but takes a predicate and a list of any type, and then splits its input list on every element for which the predicate returns False:

like image 921
Michael Barton Avatar asked Oct 15 '13 16:10

Michael Barton


1 Answers

Recursion is your friend, but I would do it a bit differently. First, I would make my condition be True when I split, rather than making it False. Second, I would utilize a handy function from Data.List called break

> :t break
break :: (a -> Bool) -> [a] -> ([a], [a])
> break (== ' ') "This is a test"
("This", " is a test")

And I'd define my function using it as

splitWith' :: (a -> Bool) -> [a] -> [[a]]
splitWith' cond [] = []
splitWith' cond xs = first : splitWith' cond (safeTail rest)
    where
        (first, rest) = break cond xs
        -- Need this function to handle an empty list
        safeTail [] = []
        safeTail (_:ys) = ys

Or, if you want to write it as confusing as possible

splitWith'' :: (a -> Bool) -> [a] -> [[a]]
splitWith'' _ [] = []
splitWith'' cond xs = uncurry (:) $ fmap (splitWith'' cond . safeTail) $ break cond xs
    where
        safeTail [] = []
        safeTail (_:ys) = ys

This works because fmap over 2-tuples applies the function to the second element of the tuple. Then it uncurries : and applies it to the first and rest.

Update

If you want it to split when the predicate is false, you can either use span instead of break, or just define it as

splitWithWeird cond xs = splitWith' (not . cond) xs

although the second will obviously incur a slightly smaller overhead (unless the compiler can optimize it away)

Update 2

If you want to handle repeating chars, there's an easy, quick fix if it suits your needs:

> filter (not . null) $ splitWithWeird (/= ' ') "This  is   a    test"
["This","is","a","test"]

With such an easy fix, we might be tempted to build it in to the algorithm itself:

splitWithWeird :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird cond [] = []
splitWithWeird cond xs = filter (not . null) $ first : splitWithWeird cond (safeTail rest)
    where
        (first, rest) = span cond xs
        safeTail [] = []
        safeTail (_:ys) = ys

But this would be a bad idea. Since this is a recursive function, you're adding in a call to filter (not . null) at each level, so at each split location in the function. All of these have to scan the entire list before returning, so there's extra checks that have to be performed. It'd be better to define it as a separate function so that filter (not . null) is only called once:

splitWithWeird' :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird' cond xs = filter (not . null) $ splitWithWeird cond xs

Or, if you wanted it built in to the algorithm:

splitWithWeird :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird cond xs = filter (not . null) $ splitWithHelper cond xs
    where
        safeTail [] = []
        safeTail (_:ys) = ys
        splitWithHelper cond [] = []
        splitWithHelper cond xs =
            let (first, rest) = span cond xs
            in first : splitWithHelper cond (safeTail rest)

Which is really just doing the same thing internally as defining two functions. Notice that I had to use the additional let ... in ... statement here (I don't like nesting wheres) because (first, rest) = span cond xs belongs to splitWithHelper, not to splitWithWeird. If you leave it in the where clause, the algorithm won't work.

Update 3

Not wanting to leave only a non-ideal solution here, I've gone ahead and written an algorithm for splitting on a subsequence instead of just on a condition or element. It does use the first function from Control.Arrow, but only to make the code significantly more compact.

import Control.Arrow (first)

isPrefixOf :: Eq a => [a] -> [a] -> Bool
isPrefixOf [] _ = True
isPrefixOf _ [] = False
isPrefixOf (x:xs) (y:ys) = x == y && isPrefixOf xs ys

splitSubseq :: Eq a => [a] -> [a] -> [[a]]
splitSubseq sub [] = []
splitSubseq sub xs = initial : splitSubseq sub rest
    where
        lsub = length sub
        splitter [] = ([], [])
        splitter yss@(y:ys)
            | isPrefixOf sub yss = ([], drop lsub yss)
            | otherwise = first (y :) $ splitter ys
        (initial, rest) = splitter xs

I'm not saying that this is an efficient solution, but it should be fairly easy to follow. First, I defined a function isPrefixOf that returns a True if the second list starts with the first list.

I wanted to keep the same pattern of recursion (first : recursive rest), so I wrote splitter to take the place of span or break, and this is where isPrefixOf comes in. If the subsequence is the prefix of the list, then it returns ([], restAfterSubsequence), otherwise it stores the first character of the list, then repeats this operation starting with the next element. My use of first here is simply so that I could write this function recursively and succinctly. It just applies (y :) to the first element of the return value of splitter. The second element of the tuple returned from splitter is just the rest of the input that has yet to be consumed.

And in case you're interested, here are the performance statistics for this algorithm (compiled with --make -O2, i5 quad):

main = print $ sum $ take (10 ^ 7) $ map length $ splitSubseq " " $ cycle "Testing "

70000000
   6,840,052,808 bytes allocated in the heap
       2,032,868 bytes copied during GC
          42,900 bytes maximum residency (2 sample(s))
          22,636 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     13114 colls,     0 par    0.06s    0.07s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0002s    0.0004s

  TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    3.68s  (  3.74s elapsed)
  GC      time    0.06s  (  0.07s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    3.74s  (  3.81s elapsed)

And then de-embedding the summation and length:

main = print $ sum $ take (10 ^ 7) $ map length $ repeat "Testing"

70000000
     240,052,572 bytes allocated in the heap
          12,812 bytes copied during GC
          42,900 bytes maximum residency (2 sample(s))
          22,636 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       458 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.09s  (  0.09s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.11s  (  0.09s elapsed)

So we can see that this only takes out about 0.1 seconds of time, leaving us with approximately 3.64 seconds for this algorithm to split a string consisting of "Testing " repeated 10 million times, all with a tiny amount of memory used. The only downside is that this algorithm actually slows down significantly when compiled with -threaded and run with more cores.

like image 65
bheklilr Avatar answered Sep 28 '22 08:09

bheklilr