Fast, optimized UTF8 encode decode

Question

Do you know the fastest way to encode and decode UTF8 with some extra information? Here's the interesting cases that occur to me:

Serialization

I just want to encode an opaque buffer with no validation so I can decode again later. The fastest would be to use the underlying memory buffer and somehow unsafely coerce it from Text to ByteString without touching the contents.

Probably ASCII

I guess that 99% of the time my UTF8 is actually ASCII so it makes sense to do a first pass to confirm this and only further processing if it's found not to be true.

Probably not ASCII

Converse of the previous.

Probably short

A single key in JSON or a database that I guess will be 1 to 20 characters. Would be silly pay some upfront cost like vectorized SIMD approach.

Probably long

An HTML document. It's worth it pay some upfront cost for the highest throughput.

There's some more variants that are similar like if encoding JSON or URL and you think there's probably no escape characters.

I'm asking this question under the [Haskell] tag since Haskell's strong typing makes some techniques that would be easy in, say, C hard to implement. Also, there may be some special GHC tricks like using SSE4 instructions on an Intel platform that would be interesting. But this is more of a UTF8 issue in general and good ideas would be helpful to any language.

Update

After some research I propose to implement encode and decode for serialization purposes like so:

myEncode :: Text -> ByteString
myEncode = unsafeCoerce
myDecode :: ByteString -> Text
myDecode = unsafeCoerce

This is a great idea if you enjoy segfault ...

Christian Conkle · Accepted Answer

This question implicates a sprawling range of issues. I'm going to interpret it as "In Haskell, how should I convert between Unicode and other character encodings?"

In Haskell, the recommended way to convert to and from Unicode is with the functions in text-icu, which provides some basic functions:

fromUnicode :: Converter -> Text -> ByteString
toUnicode :: Converter -> ByteString -> Text

text-icu is a binding to the International Components for Unicode libraries, which does the heavy lifting for, among other things, encoding and decoding to non-Unicode character sets. Its website gives documentation on conversion in general and some specific information on how its converter implementations operate. Note that different character sets require somewhat different coverter implementations.

ICU can also attempt to automatically detect the character set of an input. "This is, at best, an imprecise operation using statistics and heuristics." No other implementation could "fix" that characteristic. The Haskell bindings do not expose that functionality as I write; see #8.

I don't know of any character set conversion procedures written in native Haskell. As the ICU documentation indicates, there is a lot of complexity; after all, this is a rich area of international computing history.

Performance

As the ICU FAQ laconically notes, "Most of the time, the memory throughput of the hard drive and RAM is the main performance constraint." Although that comment is not specifically about conversions, I'd expect it to be broadly the case here as well. Is your experience otherwise?

unsafeCoerce is not appropriate here.

Fast, optimized UTF8 encode decode

Tags:

haskell

encoding

unicode

utf-8

ghc

Serialization

Probably ASCII

Probably not ASCII

Probably short

Probably long

Update

Michael Fox

1 Answers

Performance

Christian Conkle

Recent Activity

Donate For Us

Fast, optimized UTF8 encode decode

Tags:

haskell

encoding

unicode

utf-8

ghc

Serialization

Probably ASCII

Probably not ASCII

Probably short

Probably long

Update

Michael Fox

1 Answers

Performance

Christian Conkle

Related questions

Recent Activity

Donate For Us