Although I went through a bunch of Haskell-encoding-problems-Questions I was not able to get along the following problems:
I want to read many different text files; the character encoding of the files is probably not consistent and any readFile-Function I was using is throwing exceptions when reading some files.
I tried to condense the problem: The following situation sums up the core of the it.
import Prelude hiding (writeFile, readFile)
import qualified Text.Pandoc.UTF8 as UTF (readFile, writeFile, putStr, putStrLn)
import qualified Prelude as Prel (writeFile, readFile)
import Data.ByteString.Lazy (ByteString, writeFile, readFile)
and in ghci I get the following results:
*Main> Prel.readFile "Test/A.txt"
*** Exception: Test/A.txt: hGetContents: invalid argument (invalid byte sequence) "\226\8364
*Main> Prel.readFile "Test/C.txt"
"\8230\n"
*Main> UTF.readFile "Test/A.txt"
"\8221\n"
*Main> UTF.readFile "Test/C.txt"
*** Exception: Cannot decode byte '\x85':
Data.Text.Internal.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream
Maybe the following infos help:
getLocaleEncoding
yields CP1252
*Main> readFile "Test/A.txt"
"\226\128\157\r\n"
*Main> readFile "Test/C.txt"
"\133\r\n"
My question is: How can I catch / deal with / avoid these character encoding errors? The point is: I do not know the encoding of text files in advance, I need one readFile-method that works for all. And if not possible and when an exception is thrown, I want the exception to be catched and my program to proceed in order to be able to try another readFile-function or just simply skip that text file and go to the next.
For all the reasons the other answer mentioned, this isn't easy. But all is not lost. Use charsetdetect – it's based a Mozilla algorithm, apparently – to detect the encoding for each bytestring. Then pass the detected encoding to text-icu or encoding for decoding. Detection won't work for the most weird and recondite text encodings, but it should work the rest.
What you want is impossible, for the following reason:
There are many, many 8-bit encodings where all or most possible 8-bit patterns are assigned to some character. There is simply no way to find out which encoding it is. You absolutely need to know beforehand what it is that is encoded: some russian or greek text, perhaps? Or just german, where most characters will be in the 7-bit ASCII plane, and only occasionally there will be an ä or an ß.
For this reason, smart people invented Unicode and UTF-8, and all you need to do is to say: From today on I will
Let's make people that stick to 40 year old proprietary encodings a minority, and even giants like Microsoft will be forced to give up on their bad habits!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With