Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

haskell - invalid code page byte sequence

readFile "file.html"
"start of the file... *** Exception: file.html: hGetContents: invalid argument (invalid code page byte sequence)

It's a UTF-8 file created with notepad++... how can I read the file in haskell?

like image 570
Karoly Horvath Avatar asked Oct 15 '12 20:10

Karoly Horvath


2 Answers

According to this site, your 6 bytes decode as follows:

EF BB BF -> ZERO WIDTH NO-BREAK SPACE (i.e. the BOM, although its not needed in UTF-8
C4 8D    -> LATIN SMALL LETTER C WITH CARON (what you said)
0D       -> CARRIAGE RETURN (CR)

So its a legal UTF-8 sequence.

However the standard Prelude functions originally just did ASCII. I don't know what they do now, but see this question How does GHC/Haskell decide what character encoding it's going to decode/encode from/to? for some more ideas. And then use http://hackage.haskell.org/package/utf8-string instead of the Prelude functions.

like image 157
Paul Johnson Avatar answered Nov 20 '22 16:11

Paul Johnson


By default, files are read in the system locale, so if you have a file using a non-standard encoding, you need to set the encoding of the file handle yourself.

foo = do
    handle <- openFile "file.html" ReadMode
    hSetEncoding handle utf8_bom
    contents <- hGetContents handle
    doSomethingWithContents
    hClose handle

should get you started. Note that this contains no error handling, the better way would thus be

import Control.Exception -- for bracket

foo = bracket
        (openFile "file.html" ReadMode >>= \h -> hSetEncoding h utf8_bom >> return h)
        hClose
        (\h -> hGetContents h >>= doSomething)

or

foo = withFile "file.html" ReadMode $
        \h -> do hSetEncoding h utf8_bom
                 contents <- hGetContents h
                 doSomethingWith contents
like image 13
Daniel Fischer Avatar answered Nov 20 '22 15:11

Daniel Fischer