Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What approch for simple text processing in Haskell?

Tags:

haskell

nlp

I am trying to do some simple text processing in Haskell, and I am wondering what might me the best way to go about this in an FP language. I looked at the parsec module, but this seems much more sophisticated than I am looking for as a new Haskeller. What would be the best way to strip all the punctuation from a corpus of text? My naive approach was to make a function like this:

removePunc str = [c | c <- str, c /= '.',
                                 c /= '?',
                                 c /= '.',
                                 c /= '!',
                                 c /= '-',
                                 c /= ';',
                                 c /= '\'',
                                 c /= '\"',]
like image 992
turtle Avatar asked Jul 11 '12 01:07

turtle


3 Answers

A possibly more efficient method (O(log n) rather than O(n)), is to use a Set (from Data.Set):

import qualified Data.Set as S

punctuation = S.fromList ",?,-;'\""

removePunc = filter (`S.notMember` punctuation)

You must construct the set outside the function, so that it is only computed once (by being shared across all calls), since the overhead of creating the set is much larger than the simple linear-time notElem test others have suggested.

Note: this is such a small situation that the extra overhead of a Set might outweight the asymptotic benefits of the set versus the list, so if one is looking for absolute performance this must be profiled.

like image 166
huon Avatar answered Nov 09 '22 12:11

huon


You can simply write your code:

removePunc = filter (`notElem` ".?!-;\'\"")

or

removePunc = filter (flip notElem ".?!-;\'\"")
like image 43
Ronson Avatar answered Nov 09 '22 10:11

Ronson


You can group your characters in a String and use notElem:

[c | c <- str, c `notElem` ".?!,-;"]

or in a more functional style:

filter (\c -> c `notElem` ".?!,") str
like image 38
Daniel Avatar answered Nov 09 '22 12:11

Daniel