Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala regexps: how to return matches as array or list

Tags:

regex

scala

Is there a simple way to return regex matches as an array?
Here is how I am trying in 2.7.7:

val s = """6 1 2"""
val re = """(\d+)\s(\d+)\s(\d+)""".r
for (m <- re.findAllIn (s)) println (m) // prints "6 1 2"
re.findAllIn (s).toList.length // 3? No! It returns 1!

But I then tried:

s match {
  case re (m1, m2, m3) => println (m1)
}

And this works fine! m1 is 6, m2 is 1, etc.

Then I found something that added to my confusion:

val mit = re.findAllIn (s)
println (mit.toString)
println (mit.length)
println (mit.toString)

That prints:

non-empty iterator
1
empty iterator

The "length" call somehow modifies the state of the iterator. What is going on here?

like image 326
Dfr Avatar asked Jan 14 '10 17:01

Dfr


2 Answers

Ok, first of all, understand that findAllIn returns an Iterator. An Iterator is a consume-once mutable object. ANYTHING you do to it will change it. Read up on iterators if you are not familiar with them. If you want it to be reusable, then convert the result of findAllIn into a List, and only use that list.

Now, it seems you want all matching groups, not all matches. The method findAllIn will return all matches of the full regex that can be found on the string. For example:

scala> val s = """6 1 2, 4 1 3"""
s: java.lang.String = 6 1 2, 4 1 3

scala> val re = """(\d+)\s(\d+)\s(\d+)""".r
re: scala.util.matching.Regex = (\d+)\s(\d+)\s(\d+)

scala> for(m <- re.findAllIn(s)) println(m)
6 1 2
4 1 3

See that there are two matches, and neither of them include the ", " at the middle of the string, since that's not part of any match.

If you want the groups, you can get them like this:

scala> val s = """6 1 2"""
s: java.lang.String = 6 1 2

scala> re.findFirstMatchIn(s)
res4: Option[scala.util.matching.Regex.Match] = Some(6 1 2)

scala> res4.get.subgroups
res5: List[String] = List(6, 1, 2)

Or, using findAllIn, like this:

scala> val s = """6 1 2"""
s: java.lang.String = 6 1 2

scala> for(m <- re.findAllIn(s).matchData; e <- m.subgroups) println(e)
6
1
2

The matchData method will make an Iterator that returns Match instead of String.

like image 153
Daniel C. Sobral Avatar answered Nov 11 '22 16:11

Daniel C. Sobral


There is a difference between how unapplySeq interprets mulitple groups and how findAllIn does. findAllIn scans your pattern over the string and returns each string that matches (advancing by the match if it succeeds, or one character if it fails).

So, for example:

scala> val s = "gecko 6 1 2 3 4 5"
scala> re.findAllIn(s).toList
res3: List[String] = List(6 1 2, 3 4 5)

On the other hand, unapplySeq assumes a perfect match to the sequence.

scala> re.unapplySeq(s)
res4: Option[List[String]] = None

So, if you want to parse apart groups that you have specified in an exact regex string, use unapplySeq. If you want to find those subsets of the string that look like your regex pattern, use findAllIn. If you want to do both, chain them yourself:

scala> re.findAllIn(s).flatMap(text => re.unapplySeq(text).elements )
res5: List[List[String]] = List(List(6, 1, 2), List(3, 4, 5))
like image 38
Rex Kerr Avatar answered Nov 11 '22 17:11

Rex Kerr