How can I extract a string using regular expressions in Haskell?
let x = "xyz abc" =~ "(\\w+) \\w+" :: String
That doesn't event get a match
let x = "xyz abc" =~ "(.*) .*" :: String
That does but x ends up as "xyz abc" how do I extract only the first regex group so that x is "xyz"?
Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .
This backend provides a Haskell interface for the "posix" c-library that comes with most operating systems, and is provided by include "regex.
I wrote/maintain such packages as regex-base, regex-pcre, and regex-tdfa.
In regex-base the Text.Regex.Base.Context module documents the large number of instances of RegexContext that =~ uses. These are implemented on top of RegexLike which provides the underlying way to call matchText and matchAllText.
The [[String]] that KennyTM mentions is another instance of RegexContext, and may or may not be one that works best for you. A comprehensive instance is
RegexContext a b (AllTextMatches (Array Int) (MatchText b))
type MatchText source = Array Int (source, (MatchOffset, MatchLength))
which can be used to get a MatchText for everything:
let x :: Array Int (MatchText String)
x = getAllTextMatches $ "xyz abc" =~ "(\\w+) \\w+"
At which point x is an Array Int of matches of an Array Int of group-matches.
Note that "\w" is Perl syntax so you need regex-pcre to access it. If you want Unix/Posix extended regular expressions you should use regex-tdfa which is cross-platform and avoid using regex-posix that hits each platform's bugs in implementing the regex.h library.
Note that Perl vs Posix is not just a matter of syntax like "\w". They use very different algorithms and often return different results. Also, the time and space complexity are very different. For matching against a string of length 'n' Perl style (regex-pcre) can be O(exp(n)) in time while Posix style using regex-posix is always O(n) in time.
Cast the result as [[String]]
. Then you'll get a list of matches, each being the list of matched text and the captured subgroups.
Prelude Text.Regex.PCRE> "xyz abc more text" =~ "(\\w+) \\w+" :: [[String]]
[["xyz abc","xyz"],["more text","more"]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With