Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word splitting with regular expressions in Haskell

Tags:

regex

haskell

There are several packages available for the usage of regular expressions in Haskell (e.g. Text.Regex.Base, Text.Regex.Posix etc.). Most packages I've seen so far use a subset of Regex I know, by which I mean: I am used to split a sentence into words with the following Regex:

\\w+

Nearly all packages in Haskell I tried so far don't support this (at least the earlier mentioned and Text.Regex.TDFA neither). I know that with Posix the usage of [[:word:]+] would have the same effect, but I would like to use the variant mentioned above.

From there are two questions:

  1. Is there any package to archive that?
  2. If there really is, why is there a different common usage?
  3. What advantages or disadvantages are there?
like image 739
beyeran Avatar asked Dec 07 '11 14:12

beyeran


People also ask

How do I split a string in Word with regular expressions?

To split a string by a regular expression, pass a regex as a parameter to the split() method, e.g. str. split(/[,. \s]/) . The split method takes a string or regular expression and splits the string based on the provided separator, into an array of substrings.

Can we use regex in split a string?

split(String regex) method splits this string around matches of the given regular expression. This method works in the same way as invoking the method i.e split(String regex, int limit) with the given expression and a limit argument of zero. Therefore, trailing empty strings are not included in the resulting array.

Does Haskell support regex?

This backend provides a Haskell interface for the "posix" c-library that comes with most operating systems, and is provided by include "regex.

Is split faster than regex?

Regex will work faster in execution, however Regex's compile time and setup time will be more in instance creation. But if you keep your regex object ready in the beginning, reusing same regex to do split will be faster. String.


1 Answers

The '\w' is a Perl pattern, and supported by PCRE, which you can access in Haskell with my regex-pcre package or the pcre-light library. If your input is a list of Char then the 'words' function in the standard Prelude may be enough; if your input is ASCII bytestring then Data.ByteString.Char8 may work. There may be a utf8 library with word splitting, but I cannot quickly find it.

like image 75
Chris Kuklewicz Avatar answered Oct 07 '22 02:10

Chris Kuklewicz