Do you know the fastest way to encode and decode UTF8 with some extra information? Here's the interesting cases that occur to me:
I just want to encode an opaque buffer with no validation so I can decode again later. The fastest would be to use the underlying memory buffer and somehow unsafely coerce it from Text to ByteString without touching the contents.
I guess that 99% of the time my UTF8 is actually ASCII so it makes sense to do a first pass to confirm this and only further processing if it's found not to be true.
Converse of the previous.
A single key in JSON or a database that I guess will be 1 to 20 characters. Would be silly pay some upfront cost like vectorized SIMD approach.
An HTML document. It's worth it pay some upfront cost for the highest throughput.
There's some more variants that are similar like if encoding JSON or URL and you think there's probably no escape characters.
I'm asking this question under the [Haskell] tag since Haskell's strong typing makes some techniques that would be easy in, say, C hard to implement. Also, there may be some special GHC tricks like using SSE4 instructions on an Intel platform that would be interesting. But this is more of a UTF8 issue in general and good ideas would be helpful to any language.
After some research I propose to implement encode
and decode
for serialization purposes like so:
myEncode :: Text -> ByteString
myEncode = unsafeCoerce
myDecode :: ByteString -> Text
myDecode = unsafeCoerce
This is a great idea if you enjoy segfault ...
This question implicates a sprawling range of issues. I'm going to interpret it as "In Haskell, how should I convert between Unicode and other character encodings?"
In Haskell, the recommended way to convert to and from Unicode is with the functions in text-icu
, which provides some basic functions:
fromUnicode :: Converter -> Text -> ByteString
toUnicode :: Converter -> ByteString -> Text
text-icu
is a binding to the International Components for Unicode libraries, which does the heavy lifting for, among other things, encoding and decoding to non-Unicode character sets. Its website gives documentation on conversion in general and some specific information on how its converter implementations operate. Note that different character sets require somewhat different coverter implementations.
ICU can also attempt to automatically detect the character set of an input. "This is, at best, an imprecise operation using statistics and heuristics." No other implementation could "fix" that characteristic. The Haskell bindings do not expose that functionality as I write; see #8.
I don't know of any character set conversion procedures written in native Haskell. As the ICU documentation indicates, there is a lot of complexity; after all, this is a rich area of international computing history.
As the ICU FAQ laconically notes, "Most of the time, the memory throughput of the hard drive and RAM is the main performance constraint." Although that comment is not specifically about conversions, I'd expect it to be broadly the case here as well. Is your experience otherwise?
unsafeCoerce
is not appropriate here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With