For most programs, it's better to use UTF-8 internally and, when necessary, convert to other encodings. But in my case, I want to write a Javascript interpreter, and it's much simpler to store only UTF-16 strings (or arrays of u16
), because
I need to address 16-bits code units individually (this is a bad idea in general, but Javascript requires this). This means I need it to implement Index<usize>
.
I need to store unpaired surrogates, that is, malformed UTF-16 strings (because of this, ECMAScript strings are technically defined as arrays of u16
, that usually represent UTF-16 strings). There is an encoding aptly named WTF-8 to store unpaired surrogates in UTF-8, but I don't want to use something like this.
I want to have the usual owned / borrowed types (like String
/ str
and CString
/ CStr
) with all or most usual methods. I don't want to roll my own string type (if I can avoid).
Also, my strings will always be immutable, behind an Rc
and referred from a data structure containing weak pointers to all strings (implementing string interning). This might be relevant: perhaps it would be better to have Rc<Utf16Str>
as the string type, where Utf16Str
is the unsized string type (which can be defined as just struct Utf16Str([u16])
). That would avoid following two pointers when accessing the string, but I don't know how to instantiate an Rc
with an unsized type.
Given the above requirements, merely using rust-encoding is very inconvenient, because it treats all non-UTF-8 encodings as vectors of u8
.
Also, I'm not sure if using the std library at all might help me here. I looked into Utf16Units
and it's just an iterator, not a proper string type. (also, I know OsString
doesn't help - I'm not on Windows, and it doesn't even implement Index<usize>
)
Since there are multiple questions here I’ll try to respond separately:
I think the types you want are [u16]
and Vec<u16>
.
The default string types str
and String
are wrappers around [u8]
and Vec<u8>
(not technically true of str
which is primitive, but close enough). The point of having separate types is to maintain the invariant that the underlying bytes are well-formed in UTF-8.
Similarly, you could have Utf16Str
and Utf16String
types wrapping around [u16]
and Vec<u16>
that preserve a well-formed in UTF-16 invariant, namely that there is no unpaired surrogate.
But as you note in your question, JavaScript strings can contain unpaired surrogates. That’s because JavaScript strings are not strictly UTF-16, they really are arbitrary sequences of u16
with no additional invariant.
With no invariant to maintain, I don’t think wrapper types are all that useful.
rust-encoding supports UTF-16-LE and UTF-16-BE based on bytes. You probably want UTF-16 based on u16
’s instead.
std::str::Utf16Units
is indeed not a string type. It is an iterator returned by the str::utf16_units()
method that converts a Rust string to UTF-16 (not LE or BE). You can use .collect()
on that iterator to get a Vec<u16>
for example.
The only safe way to obtain Rc<[u16]>
is to coerce from Rc<[u16; N]>
whose size is known at compile-time, which is obviously impractical. I wouldn’t recommend the unsafe way: allocating memory, writing a header to it that hopefully matches the memory representation of RcBox
, and transmuting.
If you’re gonna do it with raw memory allocation, better use your own type so that you can use its private fields. Tendril does this: https://github.com/servo/tendril/blob/master/src/buf32.rs
Or, if you’re willing to take the cost of the extra indirection, Rc<Vec<u16>>
is safe and much easier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With