Recursive tokenizer in Haskell

Question

I am working an assignment in Haskell, to prepare for tests. The current task asks to tokenize a string after the following formula: When running "tokenize str separate remove" it should should output a list of strings. Every character in "str" appearing in string "separate" should be a string of one character. Every character in "str" appearing in string "remove" should be removed. Characters not appearing in either separate or remove should be bundled together.

Example shows that

tokenize "a + b* 12-def"   "+-*"   " "

should output

["a", "+", "b", "*", "12", "-", "def"]

my current code below

tokenize :: String -> String -> String -> [String]
tokenize [] _ _  = []
tokenize [x] _ _ = [[x]]
tokenize (x:xs) a b     | x `elem` a = [x] : tokenize xs a b
                        | x `elem` b = tokenize xs a b
                        | otherwise = (x:head rest) : tail rest
                                where
                                        rest = tokenize xs a b

It works to some extent, the problem being that it the operators in the example is bundled with the letter preceding it.

like this

["a+","b*","12-","def"]

despite the operators being in the separate string.

HTNW · Accepted Answer

First off, tokenize [x] _ _ is probably not what you want, because tokenize "a" "" "a" ends up being ["a"] when it should probably be []. Second, don't call the separator and removal lists Strings. They are just [Char]s. There is no difference underneath, because type String = [Char], but the point of a synonym is to make a semantic meaning clearer, and you are not really using your Strings as Strings, so your function isn't worthy of it. Additionally, you should shuffle the arguments to tokenize seps rems str, because that makes currying easier. Finally, you probably want to use Data.Set instead of [Char], but I won't use it here to stay closer to the question.

The issue itself is | otherwise = (x:head rest) : tail rest, which tacks any unspecial character onto the next token, even if that token is supposedly a separator. In your case, an example of this is when head rest = "+" and x = 'a', and you join them so you have "a+". You need to guard further.

(Also: your indentation is messed up: where clauses bind to the entire equation, so it's visible throughout all the guards. It should be indented such that that's clear.)

tokenize :: [Char] -> [Char] -> String -> [String]
tokenize _ _ "" = []
tokenize seps rems (x:xs)
  | x `elem` rems                      = rest
  | x `elem` seps                      = [x]:rest
  -- Pattern guard: if rest has a single-char token on top and that token is a sep...
  | ([sep]:_) <- rest, sep `elem` seps = [x]:rest
  -- Otherwise, if rest has a token on top (which isn't a sep), grow it
  | (growing:rest') <- rest            = (x:growing):rest'
  -- Or else make a new token (when rest = [])
  | otherwise                          = [x]:rest
  where rest = tokenize seps rems xs

You may also use filter:

tokenize seps rems = tokenize' . filter (not . flip elem rems)
  where tokenize' "" = []
        tokenize' (x:xs)
          | x `elem` seps                      = [x]:rest
          | ([sep]:_) <- rest, sep `elem` seps = [x]:rest
          | (growing:rest') <- rest            = (x:growing):rest'
          | otherwise                          = [x]:rest
          where rest = tokenize' xs

Recursive tokenizer in Haskell

Tags:

haskell

greenbottle

1 Answers

HTNW

Recent Activity

Donate For Us

Recursive tokenizer in Haskell

Tags:

haskell

greenbottle

1 Answers

HTNW

Related questions

Recent Activity

Donate For Us