Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping in haskell regular expressions

Tags:

regex

haskell

How can I extract a string using regular expressions in Haskell?

let x = "xyz abc" =~ "(\\w+) \\w+" :: String

That doesn't event get a match

let x = "xyz abc" =~ "(.*) .*" :: String

That does but x ends up as "xyz abc" how do I extract only the first regex group so that x is "xyz"?

like image 729
sipsorcery Avatar asked Apr 08 '11 06:04

sipsorcery


People also ask

What does grouping do in regex?

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .

Does Haskell support regex?

This backend provides a Haskell interface for the "posix" c-library that comes with most operating systems, and is provided by include "regex.


2 Answers

I wrote/maintain such packages as regex-base, regex-pcre, and regex-tdfa.

In regex-base the Text.Regex.Base.Context module documents the large number of instances of RegexContext that =~ uses. These are implemented on top of RegexLike which provides the underlying way to call matchText and matchAllText.

The [[String]] that KennyTM mentions is another instance of RegexContext, and may or may not be one that works best for you. A comprehensive instance is

RegexContext a b (AllTextMatches (Array Int) (MatchText b))

type MatchText source = Array Int (source, (MatchOffset, MatchLength))

which can be used to get a MatchText for everything:

let x :: Array Int (MatchText String)
    x = getAllTextMatches $ "xyz abc" =~ "(\\w+) \\w+"

At which point x is an Array Int of matches of an Array Int of group-matches.

Note that "\w" is Perl syntax so you need regex-pcre to access it. If you want Unix/Posix extended regular expressions you should use regex-tdfa which is cross-platform and avoid using regex-posix that hits each platform's bugs in implementing the regex.h library.

Note that Perl vs Posix is not just a matter of syntax like "\w". They use very different algorithms and often return different results. Also, the time and space complexity are very different. For matching against a string of length 'n' Perl style (regex-pcre) can be O(exp(n)) in time while Posix style using regex-posix is always O(n) in time.

like image 65
Chris Kuklewicz Avatar answered Sep 23 '22 02:09

Chris Kuklewicz


Cast the result as [[String]]. Then you'll get a list of matches, each being the list of matched text and the captured subgroups.

Prelude Text.Regex.PCRE> "xyz abc more text" =~ "(\\w+) \\w+" :: [[String]]
[["xyz abc","xyz"],["more text","more"]]
like image 23
kennytm Avatar answered Sep 26 '22 02:09

kennytm