I am trying to do some simple text processing in Haskell, and I am wondering what might me the best way to go about this in an FP language. I looked at the parsec module, but this seems much more sophisticated than I am looking for as a new Haskeller. What would be the best way to strip all the punctuation from a corpus of text? My naive approach was to make a function like this:
removePunc str = [c | c <- str, c /= '.',
c /= '?',
c /= '.',
c /= '!',
c /= '-',
c /= ';',
c /= '\'',
c /= '\"',]
A possibly more efficient method (O(log n) rather than O(n)), is to use a Set
(from Data.Set):
import qualified Data.Set as S
punctuation = S.fromList ",?,-;'\""
removePunc = filter (`S.notMember` punctuation)
You must construct the set outside the function, so that it is only computed once (by being shared across all calls), since the overhead of creating the set is much larger than the simple linear-time notElem
test others have suggested.
Note: this is such a small situation that the extra overhead of a Set
might outweight the asymptotic benefits of the set versus the list, so if one is looking for absolute performance this must be profiled.
You can simply write your code:
removePunc = filter (`notElem` ".?!-;\'\"")
or
removePunc = filter (flip notElem ".?!-;\'\"")
You can group your characters in a String and use notElem:
[c | c <- str, c `notElem` ".?!,-;"]
or in a more functional style:
filter (\c -> c `notElem` ".?!,") str
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With