Indices of all matches of a regex

Tags:

haskell

I am trying to match all the occurrences of a regex and get the indices as a result. The example from Real World Haskell says I can do

string =~ regex :: [(Int, Int)]

However, this is broken since the regex library has been updated since the publication of RWH. (See All matches of regex in Haskell and "=~" raise "No instance for (RegexContext Regex [Char] [String])"). What is the correct way to do this?

Update:

I found matchAll which might give me what I want. I have no idea how to use it, though.

551

asked Feb 06 '14 05:02

Code-Apprentice

1 Answers

The key to using matchAll is using the type annotation :: Regex when creating regexs:

import Text.Regex
import Text.Regex.Base

re = makeRegex "[^aeiou]" :: Regex
test = matchAll re "the quick brown fox"

This returns a list of arrays. To get a list of (offset,length) pairs, just access the first element of each array:

import Data.Array ((!))

matches = map (!0) $ matchAll re "the quick brown fox"
-- [(0,1),(1,1),(3,1),(4,1),(7,1),(8,1),(9,1),(10,1),(11,1),(13,1),(14,1),(15,1),(16,1),(18,1)]

To use the =~ operator, things may have changed since RWH. You should use the predefined types MatchOffset and MatchLength and the special type constructor AllMatches:

import Text.Regex.Posix

re = "[^aeiou]"
text = "the quick brown fox"

test1 = text =~ re :: Bool
  -- True

test2 = text =~ re :: String
  -- "t"

test3 = text =~ re :: (MatchOffset,MatchLength)
  -- (0,1)

test4 = text =~ re :: AllMatches [] (MatchOffset, MatchLength)
  -- (not showable)

test4' = getAllMatches $ (text =~ re :: AllMatches [] (MatchOffset, MatchLength))
  -- [(0,1),(1,1),(3,1),(4,1),(7,1),(8,1),(9,1),(10,1),(11,1),(13,1),(14,1),(15,1),(16,1),(18,1)]

See the docs for Text.Regex.Base.Context for more details on what contexts are available.

UPDATE: I believe the type constructor AllMatches was introduced to resolve the ambiguity introduced when an regex has subexpressions -- e.g.:

foo = "axx ayy" =~ "a(.)([^a])"

test1 = getAllMatches $ (foo :: AllMatches [] (MatchOffset, MatchLength))
  -- [(0,3),(3,3)]
  -- returns the locations of "axx" and "ayy" but no subexpression info

test2 = foo :: MatchArray
  -- array (0,2) [(0,(0,3)),(1,(1,1)),(2,(2,1))]
  -- returns only the match with "axx"

Both are essentially a list of offset-length pairs, but they mean different things.

146

answered Oct 11 '22 11:10

ErikR

Related questions
                            
                                Move emails where the subject matches a particular RegEx
                            
                                How can one turn regular quotes (i.e. ', ") into LaTeX/TeX quotes (i.e. `', ``'')
                            
                                substring match faster with regular expression?
                            
                                Using htaccess for auto-versioning: htaccess regex Rewrite rule not picking up pattern
                            
                                Expected lifespan of ereg, migrating to preg [duplicate]
                            
                                Regular expression for W3C compliant URLs?
                            
                                jQuery Custom Validation query for Money
                            
                                regex - illegal repetition?
                            
                                REGEXP_REPLACE - remove commas from string ONLY if enclosed in ()'s
                            
                                java: remove cdata tag from xml
                            
                                How can I get and parse Accept Header to get language at JavaScript?
                            
                                Regex - how to extract text from between quotes and exclude quotes
                            
                                What is the urls.py regex evaluation order in django?
                            
                                Optimization techniques used by std::regex_constants::optimize
                            
                                Converting Javascript Regex to PHP
                            
                                Regex, ignoring pattern if it's in quotes
                            
                                Why does this C++11 std::regex example throw a regex_error exception? [duplicate]
                            
                                Regex for a username increases CPU consumption
                            
                                Lookahead in BigQuery Regexp
                            
                                On libc++, why does regex_match("tournament", regex("tour|to|tournament")) fail?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With