I am working an assignment in Haskell, to prepare for tests. The current task asks to tokenize a string after the following formula: When running "tokenize str separate remove" it should should output a list of strings. Every character in "str" appearing in string "separate" should be a string of one character. Every character in "str" appearing in string "remove" should be removed. Characters not appearing in either separate or remove should be bundled together.
Example shows that
tokenize "a + b* 12-def" "+-*" " "
should output
["a", "+", "b", "*", "12", "-", "def"]
my current code below
tokenize :: String -> String -> String -> [String]
tokenize [] _ _ = []
tokenize [x] _ _ = [[x]]
tokenize (x:xs) a b | x `elem` a = [x] : tokenize xs a b
| x `elem` b = tokenize xs a b
| otherwise = (x:head rest) : tail rest
where
rest = tokenize xs a b
It works to some extent, the problem being that it the operators in the example is bundled with the letter preceding it.
like this
["a+","b*","12-","def"]
despite the operators being in the separate string.
First off, tokenize [x] _ _ is probably not what you want, because tokenize "a" "" "a" ends up being ["a"] when it should probably be []. Second, don't call the separator and removal lists Strings. They are just [Char]s. There is no difference underneath, because type String = [Char], but the point of a synonym is to make a semantic meaning clearer, and you are not really using your Strings as Strings, so your function isn't worthy of it. Additionally, you should shuffle the arguments to tokenize seps rems str, because that makes currying easier. Finally, you probably want to use Data.Set instead of [Char], but I won't use it here to stay closer to the question.
The issue itself is | otherwise = (x:head rest) : tail rest, which tacks any unspecial character onto the next token, even if that token is supposedly a separator. In your case, an example of this is when head rest = "+" and x = 'a', and you join them so you have "a+". You need to guard further.
(Also: your indentation is messed up: where clauses bind to the entire equation, so it's visible throughout all the guards. It should be indented such that that's clear.)
tokenize :: [Char] -> [Char] -> String -> [String]
tokenize _ _ "" = []
tokenize seps rems (x:xs)
| x `elem` rems = rest
| x `elem` seps = [x]:rest
-- Pattern guard: if rest has a single-char token on top and that token is a sep...
| ([sep]:_) <- rest, sep `elem` seps = [x]:rest
-- Otherwise, if rest has a token on top (which isn't a sep), grow it
| (growing:rest') <- rest = (x:growing):rest'
-- Or else make a new token (when rest = [])
| otherwise = [x]:rest
where rest = tokenize seps rems xs
You may also use filter:
tokenize seps rems = tokenize' . filter (not . flip elem rems)
where tokenize' "" = []
tokenize' (x:xs)
| x `elem` seps = [x]:rest
| ([sep]:_) <- rest, sep `elem` seps = [x]:rest
| (growing:rest') <- rest = (x:growing):rest'
| otherwise = [x]:rest
where rest = tokenize' xs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With