Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala Regular Expressions (string delimited by double quotes)

I am new to scala. I am trying to match a string delimited by double quotes, and I am a bit puzzled by the following behavior:

If I do the following:

val stringRegex = """"([^"]*)"(.*$)"""
val regex = stringRegex.r
val tidyTokens = Array[String]("1", "\"test\"", "'c'", "-23.3")
tidyTokens.foreach {
    token => if (token.matches (stringRegex)) println (token + " matches!")
}

I get

"test" matches!

otherwise, if I do the following:

tidyTokens.foreach {
    token => token match {
        case regex(token) => println (token + " matches!")
        case _ => println ("No match for token " + token)
    }
}

I get

No match for token 1
No match for token "test"
No match for token 'c'
No match for token -23.3

Why doesn't "test" match in the second case?

like image 246
gbgnv Avatar asked Feb 27 '13 18:02

gbgnv


1 Answers

Take your regular expression:

 "([^"]*)"(.*$)

When compiled with .r, this string yields a regex object - which, if it matches it's input string, must yield 2 captured strings - one for the ([^"]*) and the other for the (.*$). Your code

  case regex(token) => ...

Ought to reflect this, so maybe you want

  case regex(token, otherStuff) => ...

Or just

  case regex(token, _) => ...

Why? Because the case regex(matchedCaputures...) syntax works because regex is an object with an unapplySeq method. case regex(token) => ... translates (roughly) to:

 case List(token) => ...

Where List(token) is what regex.unapplySeq( inputString ) returns:

 regex.unapplySeq("\"test\"") // Returns Some(List("test", ""))

Your regex does match the string "test" but in the case statement the regex extractor's unapplySeq method returns a list of 2 strings because that is what the regex says it captures. That's unfortunate, but the compiler can't help you here because regular expressions are compiled from strings at runtime.

One alternative would be to use a non-capturing group:

 val stringRegex = """"([^"]*)"(?:.*$)"""
 //                             ^^

Then your code would work, because regex will now be an extractor object whose unapplySeq method returns only a single captured group:

 tidyTokens foreach { 
    case regex(token) => println (token + " matches!")
    case t => println ("No match for token " + t)
 }

Have a look at the tutorial on Extractor Objects, for a better understanding on how apply / unapply / unapplySeq works.

like image 173
Faiz Avatar answered Sep 27 '22 21:09

Faiz