Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ByteString assumes ISO-8859-1?

The documentation for Data.ByteString.hGetContents says

As with hGet, the string representation in the file is assumed to be ISO-8859-1.

Why should it have to "assume" anything about the "string representation in the file"? The data is not necessarily strings or encoded text at all. If I wanted something to deal with encoded text I'd use Data.Text or perhaps Data.ByteString.Char8. I thought the whole point of ByteString is that the data is handled as a list of 8-bit bytes, not as text characters. What is the impact of the assumption that it is ISO-8859-1?

like image 914
Omari Norman Avatar asked Nov 05 '13 15:11

Omari Norman


1 Answers

It's a roundabout way to say the same thing - no decoding is performed (since the encoding is 8-bit, nothing needs to be done), so hGetContents gives you bytes in range 0x00 - 0xFF:

$ cat utf-8.txt
ÇÈÄ
$ iconv -f iso8859-1 iso8859-1.txt                         
ÇÈÄ
$ ghci
> openFile "iso8859-1.txt" ReadMode >>= (\h -> fmap BS.unpack $ BS.hGetContents h)
[199,200,196,10]
> openFile "utf-8.txt" ReadMode >>= (\h -> fmap BS.unpack $ BS.hGetContents h)
[195,135,195,136,195,132,10]
like image 169
Mikhail Glushenkov Avatar answered Sep 29 '22 23:09

Mikhail Glushenkov