Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Surrogate Pair Detection Fails

I'm working on a minor side project in F# which involves porting existing C# code to F# and I've seemingly come across a difference in how regular expressions are handled between the two languages (I'm posting this to hopefully find out I am just doing something wrong).

This minor function simply detects surrogate pairs using the regular expression trick outlined here. Here's the current implementation:

let isSurrogatePair input =
    Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]")

If I then execute it against a known surrogate pair like this:

let result = isSurrogatePair "𠮷野𠮷"
printfn "%b" result

I get false in the FSI window.

If I use the equivalent C#:

public bool IsSurrogatePair(string input)
{
    return Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]");
}

And the same input value, I (correctly) get true back.

Is this a true issue? Am I simply doing something wrong in my F# implementation?

like image 528
Sven Grosen Avatar asked Mar 31 '15 02:03

Sven Grosen


2 Answers

There appears to be a bug in how F# encodes escaped Unicode characters.
Here's from the F# Interactive (note the last two results):

> "\uD500".[0] |> uint16 ;;
val it : uint16 = 54528us
> "\uD700".[0] |> uint16 ;;
val it : uint16 = 55040us
> "\uD800".[0] |> uint16 ;;
val it : uint16 = 65533us
> "\uD900".[0] |> uint16 ;;
val it : uint16 = 65533us

Fortunately, this workaround works:

> let s = new System.String( [| char 0xD800 |] )
s.[0] |> uint16
;;

val s : System.String = "�"
val it : uint16 = 55296us

Based on that finding, I can construct a corrected (or, rather, workarounded) version of isSurrogatePair:

let isSurrogatePair input =
  let chrToStr code = new System.String( [| char code |] )
  let regex = "[" + (chrToStr 0xD800) + "-" + (chrToStr 0xDBFF) + "][" + (chrToStr 0xDC00) + "-" + (chrToStr 0xDFFF) + "]"
  Regex.IsMatch(input,  regex)

This version correctly returns true for your input.

I have just filed this issue on GitHub: https://github.com/Microsoft/visualfsharp/issues/338

like image 114
Fyodor Soikin Avatar answered Sep 19 '22 02:09

Fyodor Soikin


Seems that this is a legitimate F# bug, no argument there. Just wanted to suggest some alternative workarounds.


Don't embed the problem characters in the string itself, specify them using regex's normal unicode support. The regex pattern to match unicode codepoint XXXX is \uXXXX, so just escape your backslashes or use a verbatim string:

Regex.IsMatch(input, "[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]")
// or
Regex.IsMatch(input, @"[\uD800-\uDBFF][\uDC00-\uDFFF]")

Use built-in regex support for unicode blocks:

// high surrogate followed by low surrogate
Regex.IsMatch(input, @"(\p{IsHighSurrogates}|\p{IsHighPrivateUseSurrogates})\p{IsLowSurrogates}")

or properties

// 2 characters, each of which is half of a surrogate pair
// (maybe could give false-positive if both are, e.g. low-surrogates)
Regex.IsMatch(input, @"\p{Cs}{2}")
like image 20
latkin Avatar answered Sep 20 '22 02:09

latkin