ByteStrings, Text, and encoding in Haskell

Question

I wish to grab an input text using the IO functionality of Data.Text. My quandry has to do with encoding discovery. That is, if I am not aware of the encoding of the text before-hand, how is the IO in Data.Text of any use at all in situations where the encoding of the text being read is different than the system locale setting? Is there an encoding discovery mechanism somewhere in Data.Text?

I know I might get a bunch of responses that say "use Data.ByteString", but wasn't Data.Text created for the purpose of getting away from the use of Data.ByteString for reading text?

Also, if I must use Data.ByteString, does anyone know what happens when octets 0x80 to 0x9f are read? Are they read in as expected like the rest of the input? They are undefined in ISO-8859-1, and Data.ByteString's IO seems to indicate that input is treated as if the source is ISO-8859-1.

Jon Purdy · Accepted Answer

You’ll want to use ByteString for reading bytes, and, for example:

decodeUtf8' :: ByteString -> Either UnicodeException Text

From Data.Text.Encoding to actually decode the raw data and handle any encoding errors. There is no predefined mechanism in text for guessing encoding, but you can try to decode multiple times, or use ICU’s character set detection facilities. Unfortunately, that functionality is not currently available in text-icu, so you’ll need to import it yourself.

GS - Apologise to Monica · Answer

If you don't know the encoding in advance, I think using Data.ByteString and reading in binary mode is exactly the right thing to do. You should get the input data exactly as bytes including octets 0x80 to 0x9f.

Data.Text is the right way to represent something with a known encoding, or rather in decoded form, but if you can't do the decoding on read then I don't think it makes sense to use it at that point.

If your code can later learn or guess the encoding appropriately that's the right time to make the switch.

ByteStrings, Text, and encoding in Haskell

Tags:

io

character-encoding

haskell

Mike Menzel

2 Answers

Jon Purdy

GS - Apologise to Monica

Recent Activity

Donate For Us

ByteStrings, Text, and encoding in Haskell

Tags:

io

character-encoding

haskell

Mike Menzel

2 Answers

Jon Purdy

GS - Apologise to Monica

Related questions

Recent Activity

Donate For Us