I'm on windows w/ codepage 949.. and Excel and Notepad.exe will happily save files with cp949
encoding.
On python dealing with them isn't a pain - with str.encode
and str.decode
.
recently I've discovered Haskell, and it seems like there is more than one way to manipulate strings. real world haskell tells me to use ByteString
for efficient IO, but I don't see a way to switch between encodings I use.
I have to read files that are not in UTF8 encoding, and write them back in their original encoding. most of them will be cp949
.
Internally my haskell source will be in utf8
.
It wasn't that hard in python, with the principle of str
for IO, unicode
for processing, but on haskell they even lack built-in cp949
support.
so the question is - how do I do IO over files in various encodings? I have to read, convert, process, and write them.
I tried both options and .. it seems the state of text conversion on windows is abysmal.
text-icu
pros:
text
seems to be modern, high-level choice for text manipulationinclude
and lib
folders when installing text-icu
with cabal install
.cons:
Lazy
bytestringsiconv
pros:
cons:
iconv
(the command-line, or dll) you have to feed unbuffered input to get proper output but haskell's binding seems to only work with lazy bytestringsYou can use Convert
module of the text-icu
package for encodings not directly supported by text
.
Assuming you already got the encoded ByteString
, you would do something like this:
import qualified Data.Text.ICU.Convert as Convert
decodeCP949 :: ByteString -> IO Text
decodeCP949 bs = do
conv <- Convert.open "cp949" Nothing
return $ Convert.toUnicode conv bs
encodeCP949 :: Text -> IO ByteString
encodeCP949 t = do
conv <- Convert.open "cp949" Nothing
return $ Convert.fromUnicode conv t
The IO
here is a bit annoying here. I think this is a case where using unsafePerfomIO
would to obtain the converter once would be alright.
You can use the Codec.Text.IConv
module in the iconv
package:
http://hackage.haskell.org/package/iconv-0.4.1.2/docs/Codec-Text-IConv.html
The convert
function will convert from one encoding to another, so you can convert a CP949 ByteString to a UTF8 ByteString (and then to Text if you want.)
And you can also reverse the process (Text -> UTF8 ByteString -> CP949 ByteString)
Here is some example code I found on github:
https://github.com/wookay/da/blob/master/haskell/fun/test_encode.hs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With