Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell character encoding problems with readFile

Although I went through a bunch of Haskell-encoding-problems-Questions I was not able to get along the following problems:

I want to read many different text files; the character encoding of the files is probably not consistent and any readFile-Function I was using is throwing exceptions when reading some files.

I tried to condense the problem: The following situation sums up the core of the it.

import Prelude hiding (writeFile, readFile)
import qualified Text.Pandoc.UTF8 as UTF (readFile, writeFile, putStr, putStrLn)
import qualified Prelude as Prel (writeFile, readFile)
import Data.ByteString.Lazy (ByteString, writeFile, readFile)

and in ghci I get the following results:

*Main> Prel.readFile "Test/A.txt"
*** Exception: Test/A.txt: hGetContents: invalid argument (invalid byte sequence) "\226\8364
*Main> Prel.readFile "Test/C.txt"
"\8230\n"

*Main> UTF.readFile "Test/A.txt"
"\8221\n"

*Main> UTF.readFile "Test/C.txt"
*** Exception: Cannot decode byte '\x85':      
Data.Text.Internal.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream 

Maybe the following infos help:

  • The getLocaleEncoding yields CP1252
  • The ByteString of the two "problematic" text files:

*Main> readFile "Test/A.txt" "\226\128\157\r\n" *Main> readFile "Test/C.txt" "\133\r\n"

My question is: How can I catch / deal with / avoid these character encoding errors? The point is: I do not know the encoding of text files in advance, I need one readFile-method that works for all. And if not possible and when an exception is thrown, I want the exception to be catched and my program to proceed in order to be able to try another readFile-function or just simply skip that text file and go to the next.

like image 208
phynfo Avatar asked Jan 14 '16 22:01

phynfo


2 Answers

For all the reasons the other answer mentioned, this isn't easy. But all is not lost. Use charsetdetect – it's based a Mozilla algorithm, apparently – to detect the encoding for each bytestring. Then pass the detected encoding to text-icu or encoding for decoding. Detection won't work for the most weird and recondite text encodings, but it should work the rest.

like image 128
hao Avatar answered Nov 19 '22 09:11

hao


What you want is impossible, for the following reason:

There are many, many 8-bit encodings where all or most possible 8-bit patterns are assigned to some character. There is simply no way to find out which encoding it is. You absolutely need to know beforehand what it is that is encoded: some russian or greek text, perhaps? Or just german, where most characters will be in the 7-bit ASCII plane, and only occasionally there will be an ä or an ß.

For this reason, smart people invented Unicode and UTF-8, and all you need to do is to say: From today on I will

  • write all texts in UTF-8
  • not accept any file which is not UTF-8 encoded.
  • will cancel all social relationships with people that supply alledgedly UTF-8 encoded files, when it turns out that the file starts with a so called "byte-order-mark" (BOM).

Let's make people that stick to 40 year old proprietary encodings a minority, and even giants like Microsoft will be forced to give up on their bad habits!

like image 35
Ingo Avatar answered Nov 19 '22 09:11

Ingo