Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Character encoding issue in GHC

When I try and read a plaintext file from within my Haskell program I get:

[fromList * Exception: /path/to/file/aaa.txt hGetContents: invalid argument (Invalid or incomplete multibyte or wide character)

I googled to find this problem is usually set right by setting LANG to en_US.UTF-8 That's already how my locale looks.

Not sure if this is an issue with GHC at all.

I am on Ubuntu 11.10

like image 778
atlantis Avatar asked Mar 04 '26 02:03

atlantis


1 Answers

Are you sure aaa.txt contains valid UTF-8? If it's binary data, you need to use withBinaryFile or similar. If it is text in another encoding, you should use hSetEncoding.

For instance, if your text is in Latin-1 then you would say

hSetEncoding h latin1

where "h" is your file handle. If you are reading from standard input then its

hSetEncoding stdin latin1

There is also a mkTextEncoding function which you can use if you have read the encoding from metadata, or wish to customise the handling of invalid Unicode (although this only works on some systems).

The Unicode standards say that a Unicode parser should reject invalid strings with an error, rather than trying to fix them up. This is a deliberate rejection of Postel's Law, on the grounds of reducing security holes and inconsistent interpretations.

(You might want to consider using the text library if you'll be working with a lot of text and having to handle encoding issues; it's usually a lot faster than using Strings, since it uses an unboxed array rather than a linked list, although this means that Text values and operations on them are necessarily strict. It also lets you configure how to respond to invalid Unicode more portably and flexibly.)

like image 127
ehird Avatar answered Mar 05 '26 21:03

ehird



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!