I'm working on a minor side project in F# which involves porting existing C# code to F# and I've seemingly come across a difference in how regular expressions are handled between the two languages (I'm posting this to hopefully find out I am just doing something wrong).
This minor function simply detects surrogate pairs using the regular expression trick outlined here. Here's the current implementation:
let isSurrogatePair input =
Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]")
If I then execute it against a known surrogate pair like this:
let result = isSurrogatePair "𠮷野𠮷"
printfn "%b" result
I get false
in the FSI window.
If I use the equivalent C#:
public bool IsSurrogatePair(string input)
{
return Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]");
}
And the same input value, I (correctly) get true
back.
Is this a true issue? Am I simply doing something wrong in my F# implementation?
There appears to be a bug in how F# encodes escaped Unicode characters.
Here's from the F# Interactive (note the last two results):
> "\uD500".[0] |> uint16 ;;
val it : uint16 = 54528us
> "\uD700".[0] |> uint16 ;;
val it : uint16 = 55040us
> "\uD800".[0] |> uint16 ;;
val it : uint16 = 65533us
> "\uD900".[0] |> uint16 ;;
val it : uint16 = 65533us
Fortunately, this workaround works:
> let s = new System.String( [| char 0xD800 |] )
s.[0] |> uint16
;;
val s : System.String = "�"
val it : uint16 = 55296us
Based on that finding, I can construct a corrected (or, rather, workarounded) version of isSurrogatePair
:
let isSurrogatePair input =
let chrToStr code = new System.String( [| char code |] )
let regex = "[" + (chrToStr 0xD800) + "-" + (chrToStr 0xDBFF) + "][" + (chrToStr 0xDC00) + "-" + (chrToStr 0xDFFF) + "]"
Regex.IsMatch(input, regex)
This version correctly returns true
for your input.
I have just filed this issue on GitHub: https://github.com/Microsoft/visualfsharp/issues/338
Seems that this is a legitimate F# bug, no argument there. Just wanted to suggest some alternative workarounds.
Don't embed the problem characters in the string itself, specify them using regex's normal unicode support. The regex pattern to match unicode codepoint XXXX
is \uXXXX
, so just escape your backslashes or use a verbatim string:
Regex.IsMatch(input, "[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]")
// or
Regex.IsMatch(input, @"[\uD800-\uDBFF][\uDC00-\uDFFF]")
Use built-in regex support for unicode blocks:
// high surrogate followed by low surrogate
Regex.IsMatch(input, @"(\p{IsHighSurrogates}|\p{IsHighPrivateUseSurrogates})\p{IsLowSurrogates}")
or properties
// 2 characters, each of which is half of a surrogate pair
// (maybe could give false-positive if both are, e.g. low-surrogates)
Regex.IsMatch(input, @"\p{Cs}{2}")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With