Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ByteStrings, Text, and encoding in Haskell

I wish to grab an input text using the IO functionality of Data.Text. My quandry has to do with encoding discovery. That is, if I am not aware of the encoding of the text before-hand, how is the IO in Data.Text of any use at all in situations where the encoding of the text being read is different than the system locale setting? Is there an encoding discovery mechanism somewhere in Data.Text?

I know I might get a bunch of responses that say "use Data.ByteString", but wasn't Data.Text created for the purpose of getting away from the use of Data.ByteString for reading text?

Also, if I must use Data.ByteString, does anyone know what happens when octets 0x80 to 0x9f are read? Are they read in as expected like the rest of the input? They are undefined in ISO-8859-1, and Data.ByteString's IO seems to indicate that input is treated as if the source is ISO-8859-1.

like image 415
Mike Menzel Avatar asked Dec 21 '13 07:12

Mike Menzel


2 Answers

You’ll want to use ByteString for reading bytes, and, for example:

decodeUtf8' :: ByteString -> Either UnicodeException Text

From Data.Text.Encoding to actually decode the raw data and handle any encoding errors. There is no predefined mechanism in text for guessing encoding, but you can try to decode multiple times, or use ICU’s character set detection facilities. Unfortunately, that functionality is not currently available in text-icu, so you’ll need to import it yourself.

like image 124
Jon Purdy Avatar answered Nov 20 '22 00:11

Jon Purdy


If you don't know the encoding in advance, I think using Data.ByteString and reading in binary mode is exactly the right thing to do. You should get the input data exactly as bytes including octets 0x80 to 0x9f.

Data.Text is the right way to represent something with a known encoding, or rather in decoded form, but if you can't do the decoding on read then I don't think it makes sense to use it at that point.

If your code can later learn or guess the encoding appropriately that's the right time to make the switch.

like image 21
GS - Apologise to Monica Avatar answered Nov 19 '22 23:11

GS - Apologise to Monica