Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

`Data.Text` vs `Data.Vector.Unboxed Char`

Tags:

haskell

Is there any difference in how Data.Text and Data.Vector.Unboxed Char work internally? Why would I choose one over the other?

I always thought it was cool that Haskell defines String as [Char]. Is there a reason that something analagous wasn't done for Text and Vector Char?

There certainly would be an advantage to making them the same.... Text-y and Vector-y tools could be written to be used in both camps. Imagine Ropes of Ints, or Regexes on strings of poker cards.

Of course, I understand that there were probably historical reasons and I understand that most current libraries use Data.Text, not Vector Char, so there are many practical reasons to favor one over the other. But I am more interested in learning about the abstract qualities, not the current state that we happen to be in.... If the whole thing were rewritten tomorrow, would it be better to unify the two?

Edit, with more info-

To put stuff into perspective-

  1. According to this page, http://www.haskell.org/haskellwiki/GHC/Memory_Footprint, GHC uses 16 bytes for each Char in your program!

  2. Data.Text is not O(1) index'able, it is O(n).

  3. Ropes (binary trees wrapped around text) can also hold strings.... They have better complexity for index/insert/delete, although depending on the number of nodes and balance of the tree, index could be close to that of Text.

This is my takeaway from this-

  1. Text and Vector Char are different internally....

  2. Use String if you don't care about performance.

  3. If performance is important, default to using Text.

  4. If fast indexing of chars is necessary, and you don't mind a lot of memory overhead (up to 16x), use Vector Char.

  5. If you want to insert/delete a lot of data, use Ropes.

like image 724
jamshidh Avatar asked Dec 19 '13 20:12

jamshidh


2 Answers

It's a fairly bad idea to think of Text as being a list of characters. Text is designed to be thought of as an opaque, user-readable blob of Unicode text. Character boundaries might be defined based on encoding, locale, language, time of month, phase of the moon, coin flips performed by a blinded participant, and migratory patterns of Venezuela's national bird whatever it may be. The same story happens with sorting, up-casing, reversing, etc.

Which is a long way of saying that Text is an abstract type representing human language and goes far out of its way to not behave just the same way as its implementation, be it a ByteString, a Vector UTF16CodePoint, or something totally unique (which is the case).

To clarify this distinction take note that there's no guarantee that unpack . pack witnesses an isomorphism, that the preferred ways of converting from Text to ByteString are in Data.Text.Encoding and are partial, and that there's a whole sophisticated plug-in module text-icu littered with complex ways of handling human language strings.

You absolutely should use Text if you're dealing with a human language string. You should also be really careful to treat it with care since human language strings are not easily amenable to computer processing. If your string is better thought of as a machine string, you probably should use ByteString.

The pedagogical advantages of type String = [Char] are high, but the practical advantages are quite low.

like image 143
J. Abrahamson Avatar answered Oct 27 '22 18:10

J. Abrahamson


To add to what J. Abrahamson said, it's also worth making the distinction between iterating over runes (roughly character by character, but really could be ideograms too) as opposed to unitary logical unicode code points. Sometimes you need to know if you're looking at a code point that has been "decorated" by a previous code point.

In the case of the latter, you then have to make the distinction between code points that stand alone (such as letters, ideograms) and those that modify the text that follows (right-to-left code point, diacritics, etc).

Well implemented unicode libraries will typically abstract these details away and let you process the text in a more or less character-by-character fashion but you have to drop certain assumptions that come from thinking in terms of ASCII.

A byte is not a character. A logical unit of text isn't necessarily a "character". Not every code point stands alone, some decorate/annotate the following code point or even the rest of the byte stream until invalidated (right-to-left).

Unicode is hard. There is no one true encoding that will eliminate the difficulty of encapsulating the variety inherent in human language. Data.Text does a respectable job of it though.

To summarize:

The methods of processing are:

  • byte-by-byte - totally invalid for unicode, only applicable to latin-1/ASCII
  • code point by code point - works for processing unicode, but is lower-level than people realize
  • logical rune-by-rune - what you actually want

The types are:

  • String (aka [Char]) - has a limited scope. Best used for teaching Haskell or for legacy use-cases.

  • Text - the preferred way to handle "human" text.

  • Bytestring - for byte streams, raw data, binary etc.

like image 45
bitemyapp Avatar answered Oct 27 '22 19:10

bitemyapp