Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the proper way to handle string encoding in haskell?

I'm on windows w/ codepage 949.. and Excel and Notepad.exe will happily save files with cp949 encoding.

On python dealing with them isn't a pain - with str.encode and str.decode.

recently I've discovered Haskell, and it seems like there is more than one way to manipulate strings. real world haskell tells me to use ByteString for efficient IO, but I don't see a way to switch between encodings I use.

I have to read files that are not in UTF8 encoding, and write them back in their original encoding. most of them will be cp949.

Internally my haskell source will be in utf8.

It wasn't that hard in python, with the principle of str for IO, unicodefor processing, but on haskell they even lack built-in cp949 support.

so the question is - how do I do IO over files in various encodings? I have to read, convert, process, and write them.


edit:

I tried both options and .. it seems the state of text conversion on windows is abysmal.

text-icu

pros:

  • text seems to be modern, high-level choice for text manipulation
  • easy to install on windows: just grab icu binaries and point include and lib folders when installing text-icu with cabal install.

cons:

  • converters are IO
  • can't initialize a converter multiple times(something to do with thread safety, I get runtime error)
  • does not work with Lazy bytestrings
  • requires >20mb dlls

iconv

pros:

  • no monads

cons:

  • a pain to install on windows
  • some decoding failures when I tried on larger files.. usually for iconv(the command-line, or dll) you have to feed unbuffered input to get proper output but haskell's binding seems to only work with lazy bytestrings
like image 719
thkang Avatar asked Feb 14 '23 17:02

thkang


2 Answers

You can use Convert module of the text-icu package for encodings not directly supported by text.

Assuming you already got the encoded ByteString, you would do something like this:

import qualified Data.Text.ICU.Convert as Convert

decodeCP949 :: ByteString -> IO Text
decodeCP949 bs = do
    conv <- Convert.open "cp949" Nothing
    return $ Convert.toUnicode conv bs

encodeCP949 :: Text -> IO ByteString
encodeCP949 t = do
    conv <- Convert.open "cp949" Nothing
    return $ Convert.fromUnicode conv t

The IO here is a bit annoying here. I think this is a case where using unsafePerfomIO would to obtain the converter once would be alright.

like image 119
Danny Navarro Avatar answered Feb 20 '23 17:02

Danny Navarro


You can use the Codec.Text.IConv module in the iconv package:

http://hackage.haskell.org/package/iconv-0.4.1.2/docs/Codec-Text-IConv.html

The convert function will convert from one encoding to another, so you can convert a CP949 ByteString to a UTF8 ByteString (and then to Text if you want.)

And you can also reverse the process (Text -> UTF8 ByteString -> CP949 ByteString)

Here is some example code I found on github:

https://github.com/wookay/da/blob/master/haskell/fun/test_encode.hs

like image 27
ErikR Avatar answered Feb 20 '23 16:02

ErikR