I've been working my way through real world haskell and trying to do the exercises. I managed to implement a working version of splitWith from Chapter 4.5 exercise 2. I feel like this is not a very haskell way of doing things. Having to implement a new function with an accumulator seems very roundabout. Is there a more idiomatic way to do this, such as with fold? I looked at the documentation for foldl but I was left scratching my head as to how.
splitWith :: (a -> Bool) -> [a] -> [[a]]
splitWith _ [] = []
splitWith f a = splitWithAcc f a []
where
splitWithAcc :: (a -> Bool) -> [a] -> [[a]] -> [[a]]
splitWithAcc f xs acc
| null xs = acc
| f $ head xs = splitWithAcc f (dropWhile f xs) (acc ++ [takeWhile f xs])
| otherwise = splitWithAcc f (tail xs) acc
CLARIFICATION
Here is the text of the exercise:
Write a function splitWith that acts similarly to words but takes a predicate and a list of any type, and then splits its input list on every element for which the predicate returns False:
Recursion is your friend, but I would do it a bit differently. First, I would make my condition be True when I split, rather than making it False. Second, I would utilize a handy function from Data.List
called break
> :t break
break :: (a -> Bool) -> [a] -> ([a], [a])
> break (== ' ') "This is a test"
("This", " is a test")
And I'd define my function using it as
splitWith' :: (a -> Bool) -> [a] -> [[a]]
splitWith' cond [] = []
splitWith' cond xs = first : splitWith' cond (safeTail rest)
where
(first, rest) = break cond xs
-- Need this function to handle an empty list
safeTail [] = []
safeTail (_:ys) = ys
Or, if you want to write it as confusing as possible
splitWith'' :: (a -> Bool) -> [a] -> [[a]]
splitWith'' _ [] = []
splitWith'' cond xs = uncurry (:) $ fmap (splitWith'' cond . safeTail) $ break cond xs
where
safeTail [] = []
safeTail (_:ys) = ys
This works because fmap
over 2-tuples applies the function to the second element of the tuple. Then it uncurries :
and applies it to the first and rest.
Update
If you want it to split when the predicate is false, you can either use span
instead of break
, or just define it as
splitWithWeird cond xs = splitWith' (not . cond) xs
although the second will obviously incur a slightly smaller overhead (unless the compiler can optimize it away)
Update 2
If you want to handle repeating chars, there's an easy, quick fix if it suits your needs:
> filter (not . null) $ splitWithWeird (/= ' ') "This is a test"
["This","is","a","test"]
With such an easy fix, we might be tempted to build it in to the algorithm itself:
splitWithWeird :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird cond [] = []
splitWithWeird cond xs = filter (not . null) $ first : splitWithWeird cond (safeTail rest)
where
(first, rest) = span cond xs
safeTail [] = []
safeTail (_:ys) = ys
But this would be a bad idea. Since this is a recursive function, you're adding in a call to filter (not . null)
at each level, so at each split location in the function. All of these have to scan the entire list before returning, so there's extra checks that have to be performed. It'd be better to define it as a separate function so that filter (not . null)
is only called once:
splitWithWeird' :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird' cond xs = filter (not . null) $ splitWithWeird cond xs
Or, if you wanted it built in to the algorithm:
splitWithWeird :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird cond xs = filter (not . null) $ splitWithHelper cond xs
where
safeTail [] = []
safeTail (_:ys) = ys
splitWithHelper cond [] = []
splitWithHelper cond xs =
let (first, rest) = span cond xs
in first : splitWithHelper cond (safeTail rest)
Which is really just doing the same thing internally as defining two functions. Notice that I had to use the additional let ... in ...
statement here (I don't like nesting wheres) because (first, rest) = span cond xs
belongs to splitWithHelper
, not to splitWithWeird
. If you leave it in the where clause, the algorithm won't work.
Update 3
Not wanting to leave only a non-ideal solution here, I've gone ahead and written an algorithm for splitting on a subsequence instead of just on a condition or element. It does use the first
function from Control.Arrow
, but only to make the code significantly more compact.
import Control.Arrow (first)
isPrefixOf :: Eq a => [a] -> [a] -> Bool
isPrefixOf [] _ = True
isPrefixOf _ [] = False
isPrefixOf (x:xs) (y:ys) = x == y && isPrefixOf xs ys
splitSubseq :: Eq a => [a] -> [a] -> [[a]]
splitSubseq sub [] = []
splitSubseq sub xs = initial : splitSubseq sub rest
where
lsub = length sub
splitter [] = ([], [])
splitter yss@(y:ys)
| isPrefixOf sub yss = ([], drop lsub yss)
| otherwise = first (y :) $ splitter ys
(initial, rest) = splitter xs
I'm not saying that this is an efficient solution, but it should be fairly easy to follow. First, I defined a function isPrefixOf
that returns a True if the second list starts with the first list.
I wanted to keep the same pattern of recursion (first : recursive rest
), so I wrote splitter
to take the place of span
or break
, and this is where isPrefixOf
comes in. If the subsequence is the prefix of the list, then it returns ([], restAfterSubsequence)
, otherwise it stores the first character of the list, then repeats this operation starting with the next element. My use of first
here is simply so that I could write this function recursively and succinctly. It just applies (y :)
to the first element of the return value of splitter
. The second element of the tuple returned from splitter
is just the rest of the input that has yet to be consumed.
And in case you're interested, here are the performance statistics for this algorithm (compiled with --make -O2
, i5 quad):
main = print $ sum $ take (10 ^ 7) $ map length $ splitSubseq " " $ cycle "Testing "
70000000
6,840,052,808 bytes allocated in the heap
2,032,868 bytes copied during GC
42,900 bytes maximum residency (2 sample(s))
22,636 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 13114 colls, 0 par 0.06s 0.07s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0004s
TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.68s ( 3.74s elapsed)
GC time 0.06s ( 0.07s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.74s ( 3.81s elapsed)
And then de-embedding the summation and length:
main = print $ sum $ take (10 ^ 7) $ map length $ repeat "Testing"
70000000
240,052,572 bytes allocated in the heap
12,812 bytes copied during GC
42,900 bytes maximum residency (2 sample(s))
22,636 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 458 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.09s ( 0.09s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.11s ( 0.09s elapsed)
So we can see that this only takes out about 0.1 seconds of time, leaving us with approximately 3.64 seconds for this algorithm to split a string consisting of "Testing "
repeated 10 million times, all with a tiny amount of memory used. The only downside is that this algorithm actually slows down significantly when compiled with -threaded
and run with more cores.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With