Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data.Text vs Data.ByteString.Char8

Tags:

Can anyone explain the pros and cons to using Data.Textand Data.ByteString.Char8 data types? Does working with ASCII-only text change these pros and cons? Do their lazy variants change the story as well?

like image 535
Thomas Eding Avatar asked Jan 18 '12 19:01

Thomas Eding


People also ask

What is CHAR8?

CHAR8 is a standard DATA Element within the SAP ABAP dictionary and is associated with fields that store Purchasing Document information.

What is text Haskell?

Text is a more efficient alternative to Haskell's standard String type. String is defined as a linked list of characters in the standard Prelude, per the Haskell Report: type String = [Char] Text is represented as a packed array of Unicode characters.


1 Answers

Data.ByteString.Char8 provides functions to treat ByteString values as sequences of 8-bit ASCII characters, while Data.Text is an independent type supporting the entirety of Unicode.

ByteString and Text are essentially the same, as far as representation goes — strict, unboxed arrays with lazy variants based on lists of strict chunks. The main difference is that ByteString stores octets (i.e. Word8s), while Text stores Chars, encoded in UTF-16.

If you're working with ASCII-only text, then using Data.ByteString.Char8 will probably be faster than Text, and use less memory; however, you should ask yourself whether you're really sure that you're only ever going to work with ASCII. Basically, in 99% of cases, using Data.ByteString.Char8 over Text is a speed hack — octets aren't characters, and any Haskeller can agree that using the correct type should be prioritised over raw, bare-metal speed. You should usually only consider it if you've profiled the program and it's a bottleneck. Text is well-optimised, and the difference will probably be negligible in most cases.

Of course, there are non-speed-related situations in which Data.ByteString.Char8 is warranted. Consider a file containing data that is essentially binary, not text, but separated into lines; using lines is completely reasonable. Additionally, it's entirely conceivable that an integer might be encoded in ASCII decimal in the context of a binary format; using readInt would make perfect sense in that case.

So, basically:

  1. Data.ByteString.Char8: For pure ASCII situations where performance is paramount, and to handle "almost-binary" data that has some ASCII components.
  2. Data.Text: Text, including any situation where there's the slightest possibility of something other than ASCII being used.
like image 58
ehird Avatar answered Sep 26 '22 03:09

ehird