Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiline Matching in Haskell Posix

I can't seem to find decent documentation on haskell's POSIX implementation. Specifically the module Text.Regex.Posix.

Can anyone point me in the right direction of using multiline matching on a string?

A snippet for the curious:

> extractToken body = body =~ "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>" :: String

I'm trying to extract the source of wikipedia pages, however this method clearly falls over when more than one line is involved.

like image 650
Ian Elliott Avatar asked Jan 23 '23 12:01

Ian Elliott


2 Answers

You may need to import Text.Regex.Base.RegexLike for access to makeRegexOpts and friends.

extractToken body = match regex body where
    regex = makeRegexOpts (defaultCompOpt - compNewline) defaultExecOpt
              "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"

Well, since Text.Regex.Posix's defaultCompOpt = compExtended + compNewline, that works out equivalently as

extractToken body = match regex body where
    regex = makeRegexOpts compExtended defaultExecOpt
              "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"

To pull out just the first group, use one of the other instances of RegexLike. One possibility is

extractToken body = head groups where
    (preMatch, inMatch, postMatch, groups) =
        match regex body :: (String, String, String, [String])
    regex = makeRegexOpts compExtended defaultExecOpt
              "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"
like image 126
ephemient Avatar answered Jan 31 '23 19:01

ephemient


You may need to use the PCRE backend instead if you want to do anything more flexible, or with better performance, than Posix regexes.

pcre-light and regex-pcre are both fine.

like image 42
Don Stewart Avatar answered Jan 31 '23 19:01

Don Stewart