I wish to grab an input text using the IO functionality of Data.Text
. My quandry has to do with encoding discovery. That is, if I am not aware of the encoding of the text before-hand, how is the IO in Data.Text
of any use at all in situations where the encoding of the text being read is different than the system locale setting? Is there an encoding discovery mechanism somewhere in Data.Text
?
I know I might get a bunch of responses that say "use Data.ByteString
", but wasn't Data.Text
created for the purpose of getting away from the use of Data.ByteString
for reading text?
Also, if I must use Data.ByteString
, does anyone know what happens when octets 0x80 to 0x9f are read? Are they read in as expected like the rest of the input? They are undefined in ISO-8859-1, and Data.ByteString
's IO seems to indicate that input is treated as if the source is ISO-8859-1.
You’ll want to use ByteString
for reading bytes, and, for example:
decodeUtf8' :: ByteString -> Either UnicodeException Text
From Data.Text.Encoding
to actually decode the raw data and handle any encoding errors. There is no predefined mechanism in text
for guessing encoding, but you can try to decode multiple times, or use ICU’s character set detection facilities. Unfortunately, that functionality is not currently available in text-icu
, so you’ll need to import it yourself.
If you don't know the encoding in advance, I think using Data.ByteString
and reading in binary mode is exactly the right thing to do. You should get the input data exactly as bytes including octets 0x80 to 0x9f.
Data.Text
is the right way to represent something with a known encoding, or rather in decoded form, but if you can't do the decoding on read then I don't think it makes sense to use it at that point.
If your code can later learn or guess the encoding appropriately that's the right time to make the switch.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With