readFile "file.html"
"start of the file... *** Exception: file.html: hGetContents: invalid argument (invalid code page byte sequence)
It's a UTF-8 file created with notepad++... how can I read the file in haskell?
According to this site, your 6 bytes decode as follows:
EF BB BF -> ZERO WIDTH NO-BREAK SPACE (i.e. the BOM, although its not needed in UTF-8
C4 8D -> LATIN SMALL LETTER C WITH CARON (what you said)
0D -> CARRIAGE RETURN (CR)
So its a legal UTF-8 sequence.
However the standard Prelude functions originally just did ASCII. I don't know what they do now, but see this question How does GHC/Haskell decide what character encoding it's going to decode/encode from/to? for some more ideas. And then use http://hackage.haskell.org/package/utf8-string instead of the Prelude functions.
By default, files are read in the system locale, so if you have a file using a non-standard encoding, you need to set the encoding of the file handle yourself.
foo = do
handle <- openFile "file.html" ReadMode
hSetEncoding handle utf8_bom
contents <- hGetContents handle
doSomethingWithContents
hClose handle
should get you started. Note that this contains no error handling, the better way would thus be
import Control.Exception -- for bracket
foo = bracket
(openFile "file.html" ReadMode >>= \h -> hSetEncoding h utf8_bom >> return h)
hClose
(\h -> hGetContents h >>= doSomething)
or
foo = withFile "file.html" ReadMode $
\h -> do hSetEncoding h utf8_bom
contents <- hGetContents h
doSomethingWith contents
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With