I'm on windows w/ codepage 949.. and Excel and Notepad.exe will happily save files with cp949 encoding.
On python dealing with them isn't a pain - with str.encode and str.decode.
recently I've discovered Haskell, and it seems like there is more than one way to manipulate strings. real world haskell tells me to use ByteString for efficient IO, but I don't see a way to switch between encodings I use.
I have to read files that are not in UTF8 encoding, and write them back in their original encoding. most of them will be cp949.
Internally my haskell source will be in utf8.
It wasn't that hard in python, with the principle of str for IO, unicodefor processing, but on haskell they even lack built-in cp949 support.
so the question is - how do I do IO over files in various encodings? I have to read, convert, process, and write them.
I tried both options and .. it seems the state of text conversion on windows is abysmal.
text-icu
pros:
text seems to be modern, high-level choice for text manipulationinclude and lib folders when installing text-icu with cabal install.cons:
Lazy bytestringsiconv
pros:
cons:
iconv(the command-line, or dll) you have to feed unbuffered input to get proper output but haskell's binding seems to only work with lazy bytestringsYou can use Convert module of the text-icu package for encodings not directly supported by text.
Assuming you already got the encoded ByteString, you would do something like this:
import qualified Data.Text.ICU.Convert as Convert
decodeCP949 :: ByteString -> IO Text
decodeCP949 bs = do
conv <- Convert.open "cp949" Nothing
return $ Convert.toUnicode conv bs
encodeCP949 :: Text -> IO ByteString
encodeCP949 t = do
conv <- Convert.open "cp949" Nothing
return $ Convert.fromUnicode conv t
The IO here is a bit annoying here. I think this is a case where using unsafePerfomIO would to obtain the converter once would be alright.
You can use the Codec.Text.IConv module in the iconv package:
http://hackage.haskell.org/package/iconv-0.4.1.2/docs/Codec-Text-IConv.html
The convert function will convert from one encoding to another, so you can convert a CP949 ByteString to a UTF8 ByteString (and then to Text if you want.)
And you can also reverse the process (Text -> UTF8 ByteString -> CP949 ByteString)
Here is some example code I found on github:
https://github.com/wookay/da/blob/master/haskell/fun/test_encode.hs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With