Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Swift Regex matching fails when source contains unicode characters

I'm trying to do a simple regex match using NSRegularExpression, but I'm having some problems matching the string when the source contains multibyte characters:

let string = "D 9"

// The following matches (any characters)(SPACE)(numbers)(any characters)
let pattern = "([\\s\\S]*) ([0-9]*)(.*)"

let slen : Int = string.lengthOfBytesUsingEncoding(NSUTF8StringEncoding)

var error: NSError? = nil

var regex = NSRegularExpression(pattern: pattern, options: NSRegularExpressionOptions.DotMatchesLineSeparators, error: &error)

var result = regex?.stringByReplacingMatchesInString(string, options: nil, range: NSRange(location:0,
length:slen), withTemplate: "First \"$1\" Second: \"$2\"")

The code above returns "D" and "9" as expected

If I now change the first line to include a UK 'Pound' currency symbol as follows:

let string = "£ 9"

Then the match doesn't work, even though the ([\\s\\S]*) part of the expression should still match any leading characters.

I understand that the £ symbol will take two bytes but the wildcard leading match should ignore those shouldn't it?

Can anyone explain what is going on here please?

like image 776
NEIL STRONG Avatar asked Apr 20 '15 19:04

NEIL STRONG


People also ask

Does regex work with Unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

How do I match a character in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

Does regex match anything?

In regular expressions, we can match any character using period "." character. To match multiple characters or a given set of characters, we should use character classes.

Does Swift have regex?

Swift's regex syntax is compatible with Perl, Python, Ruby, Java, NSRegularExpression, and many, many others. This regex matches one or more digits. The compiler knows regex syntax, so you'll get syntax highlighting, compile-time errors, and even strongly-typed captures, which we'll be meeting later.


1 Answers

It can be confusing. The first parameter of stringByReplacingMatchesInString() is mapped from NSString in Objective-C to String in Swift, but the range: parameter is still an NSRange. Therefore you have to specify the range in the units used by NSString (which is the number of UTF-16 code points):

var result = regex?.stringByReplacingMatchesInString(string,
        options: nil,
        range: NSRange(location:0, length:(string as NSString).length),
        withTemplate: "First \"$1\" Second: \"$2\"")

Alternatively you can use count(string.utf16) instead of (string as NSString).length .

Full example:

let string = "£ 9"

let pattern = "([\\s\\S]*) ([0-9]*)(.*)"
var error: NSError? = nil
let regex = NSRegularExpression(pattern: pattern,
        options: NSRegularExpressionOptions.DotMatchesLineSeparators,
        error: &error)!

let result = regex.stringByReplacingMatchesInString(string,
    options: nil,
    range: NSRange(location:0, length:(string as NSString).length),
    withTemplate: "First \"$1\" Second: \"$2\"")
println(result)
// First "£" Second: "9"
like image 74
Martin R Avatar answered Sep 22 '22 00:09

Martin R