Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Issue with surrogate unicode characters in F#

I'm working with strings, which could contain surrogate unicode characters (non-BMP, 4 bytes per character).

When I use "\Uxxxxxxxxv" format to specify surrogate character in F# - for some characters it gives different result than in the case of C#. For example:

C#:

string s = "\U0001D11E";
bool c = Char.IsSurrogate(s, 0);
Console.WriteLine(String.Format("Length: {0}, is surrogate: {1}", s.Length, c));

Gives: Length: 2, is surrogate: True

F#:

let s = "\U0001D11E"
let c = Char.IsSurrogate(s, 0)
printf "Length: %d, is surrogate: %b" s.Length c

Gives: Length: 2, is surrogate: false

Note: Some surrogate characters works in F# ("\U0010011", "\U00100011"), but some of them doesn't work.

Q: Is this is bug in F#? How can I handle allowed surrogate unicode characters in strings with F# (Does F# has different format, or only the way is to use Char.ConvertFromUtf32 0x1D11E)

Update:
s.ToCharArray() gives for F# [| 0xD800; 0xDF41 |]; for C# { 0xD834, 0xDD1E }

like image 718
Vitaliy Avatar asked Apr 12 '12 12:04

Vitaliy


1 Answers

This is a known bug in the F# compiler that shipped with VS2010 (and SP1); the fix appears in the VS11 bits, so if you have the VS11 Beta and use the F# 3.0 compiler, you'll see this behave as expected.

(If the other answers/comments here don't provide you with a suitable workaround in the meantime, let me know.)

like image 148
Brian Avatar answered Sep 30 '22 14:09

Brian