Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indices of all matches of a regex

Tags:

regex

haskell

I am trying to match all the occurrences of a regex and get the indices as a result. The example from Real World Haskell says I can do

string =~ regex :: [(Int, Int)]

However, this is broken since the regex library has been updated since the publication of RWH. (See All matches of regex in Haskell and "=~" raise "No instance for (RegexContext Regex [Char] [String])"). What is the correct way to do this?

Update:

I found matchAll which might give me what I want. I have no idea how to use it, though.

like image 551
Code-Apprentice Avatar asked Feb 06 '14 05:02

Code-Apprentice


People also ask

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9.

What is the use of SPAN () in regular expression?

span() returns both start and end indexes in a single tuple. Since the match() method only checks if the RE matches at the start of a string, start() will always be zero. However, the search() method of patterns scans through the string, so the match may not start at zero in that case.

How do you find the number of matches in a regular expression?

To count the number of regex matches, call the match() method on the string, passing it the regular expression as a parameter, e.g. (str. match(/[a-z]/g) || []). length . The match method returns an array of the regex matches or null if there are no matches found.


1 Answers

The key to using matchAll is using the type annotation :: Regex when creating regexs:

import Text.Regex
import Text.Regex.Base

re = makeRegex "[^aeiou]" :: Regex
test = matchAll re "the quick brown fox"

This returns a list of arrays. To get a list of (offset,length) pairs, just access the first element of each array:

import Data.Array ((!))

matches = map (!0) $ matchAll re "the quick brown fox"
-- [(0,1),(1,1),(3,1),(4,1),(7,1),(8,1),(9,1),(10,1),(11,1),(13,1),(14,1),(15,1),(16,1),(18,1)]

To use the =~ operator, things may have changed since RWH. You should use the predefined types MatchOffset and MatchLength and the special type constructor AllMatches:

import Text.Regex.Posix

re = "[^aeiou]"
text = "the quick brown fox"

test1 = text =~ re :: Bool
  -- True

test2 = text =~ re :: String
  -- "t"

test3 = text =~ re :: (MatchOffset,MatchLength)
  -- (0,1)

test4 = text =~ re :: AllMatches [] (MatchOffset, MatchLength)
  -- (not showable)

test4' = getAllMatches $ (text =~ re :: AllMatches [] (MatchOffset, MatchLength))
  -- [(0,1),(1,1),(3,1),(4,1),(7,1),(8,1),(9,1),(10,1),(11,1),(13,1),(14,1),(15,1),(16,1),(18,1)]

See the docs for Text.Regex.Base.Context for more details on what contexts are available.

UPDATE: I believe the type constructor AllMatches was introduced to resolve the ambiguity introduced when an regex has subexpressions -- e.g.:

foo = "axx ayy" =~ "a(.)([^a])"

test1 = getAllMatches $ (foo :: AllMatches [] (MatchOffset, MatchLength))
  -- [(0,3),(3,3)]
  -- returns the locations of "axx" and "ayy" but no subexpression info

test2 = foo :: MatchArray
  -- array (0,2) [(0,(0,3)),(1,(1,1)),(2,(2,1))]
  -- returns only the match with "axx"

Both are essentially a list of offset-length pairs, but they mean different things.

like image 146
ErikR Avatar answered Oct 11 '22 11:10

ErikR