I've seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this.
Specifically I wonder if there are shortcuts to knowing when a surrogate pair will be needed without fully converting the UTF-8 sequence into a codepoint.
The best UTF-8 to codepoint code I've seen uses vectorizing techniques so I wonder if that's also possible here.
Efficiency is always a speed vs size tradeoff. If speed is favored over size then the most efficient way is just to guess based on the length of the source string.
There are 4 cases that need to be considered, simply take the worst case as the final buffer size:
The worse case expansion factor is when translating U+0000-U+007f from utf8 to utf16: the buffer, bytewise, merely has to be twice as large as the source string. Every other unicode codepoint results in an equal size, or smaller bytewise allocation when encoded as utf16 as utf8.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With