Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a Rust library with an UTF-16 string type? (intended for writing a Javascript interpreter)

For most programs, it's better to use UTF-8 internally and, when necessary, convert to other encodings. But in my case, I want to write a Javascript interpreter, and it's much simpler to store only UTF-16 strings (or arrays of u16), because

  1. I need to address 16-bits code units individually (this is a bad idea in general, but Javascript requires this). This means I need it to implement Index<usize>.

  2. I need to store unpaired surrogates, that is, malformed UTF-16 strings (because of this, ECMAScript strings are technically defined as arrays of u16, that usually represent UTF-16 strings). There is an encoding aptly named WTF-8 to store unpaired surrogates in UTF-8, but I don't want to use something like this.

I want to have the usual owned / borrowed types (like String / str and CString / CStr) with all or most usual methods. I don't want to roll my own string type (if I can avoid).

Also, my strings will always be immutable, behind an Rc and referred from a data structure containing weak pointers to all strings (implementing string interning). This might be relevant: perhaps it would be better to have Rc<Utf16Str> as the string type, where Utf16Str is the unsized string type (which can be defined as just struct Utf16Str([u16])). That would avoid following two pointers when accessing the string, but I don't know how to instantiate an Rc with an unsized type.

Given the above requirements, merely using rust-encoding is very inconvenient, because it treats all non-UTF-8 encodings as vectors of u8.

Also, I'm not sure if using the std library at all might help me here. I looked into Utf16Units and it's just an iterator, not a proper string type. (also, I know OsString doesn't help - I'm not on Windows, and it doesn't even implement Index<usize>)

like image 258
darque Avatar asked Jul 28 '15 19:07

darque


1 Answers

Since there are multiple questions here I’ll try to respond separately:


I think the types you want are [u16] and Vec<u16>.

The default string types str and String are wrappers around [u8] and Vec<u8> (not technically true of str which is primitive, but close enough). The point of having separate types is to maintain the invariant that the underlying bytes are well-formed in UTF-8.

Similarly, you could have Utf16Str and Utf16String types wrapping around [u16] and Vec<u16> that preserve a well-formed in UTF-16 invariant, namely that there is no unpaired surrogate.

But as you note in your question, JavaScript strings can contain unpaired surrogates. That’s because JavaScript strings are not strictly UTF-16, they really are arbitrary sequences of u16 with no additional invariant.

With no invariant to maintain, I don’t think wrapper types are all that useful.


rust-encoding supports UTF-16-LE and UTF-16-BE based on bytes. You probably want UTF-16 based on u16’s instead.

std::str::Utf16Units is indeed not a string type. It is an iterator returned by the str::utf16_units() method that converts a Rust string to UTF-16 (not LE or BE). You can use .collect() on that iterator to get a Vec<u16> for example.


The only safe way to obtain Rc<[u16]> is to coerce from Rc<[u16; N]> whose size is known at compile-time, which is obviously impractical. I wouldn’t recommend the unsafe way: allocating memory, writing a header to it that hopefully matches the memory representation of RcBox, and transmuting.

If you’re gonna do it with raw memory allocation, better use your own type so that you can use its private fields. Tendril does this: https://github.com/servo/tendril/blob/master/src/buf32.rs

Or, if you’re willing to take the cost of the extra indirection, Rc<Vec<u16>> is safe and much easier.

like image 62
Simon Sapin Avatar answered Nov 05 '22 12:11

Simon Sapin