Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert unescaped unicode to utf8 integer

Firstly, I apologize if the terms "unescaped unicode" and "utf8 integer" are not correct; I don't really know what I'm talking about when I'm talking about encoding.

As a concrete example, I would like to convert the string "\\u00b5ABC" to the string "\181ABC" (\u00b5 and \181 correspond to µ). By "string" I mean String or Text.

I know how to achieve this by using a tortuous (and perhaps laughable) way:

import Data.Aeson (decode)
import Data.ByteString.Lazy (packChars)
import Data.Text (Text)
decode (packChars "\"\\u00b5ABC\"") :: Maybe Text

I am ready to bet there exists a more direct way...

Edit

Following @Alec's comment, I provide more context. In the background, there is a Javascript program that receives a character string and replaces the characters in this string by their unicode representation \\uxxxx when this unicode representation is between \u007F and \uFFFF.

On the Haskell side, I receive this new string, and I want to replace the \\uxxxx with their corresponding utf8 integer representations.

like image 411
Stéphane Laurent Avatar asked Feb 02 '17 21:02

Stéphane Laurent


People also ask

How do I change Unicode to UTF-8?

Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.

What is the difference between UTF-8 and Unicode?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

How do I convert Unicode to ASCII?

You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.


1 Answers

Here's a nice simple parser written using regex-applicative. First some imports and other nonsense that isn't worth reading:

import Data.Char
import Data.Maybe
import Numeric
import Text.Regex.Applicative

-- no idea why this isn't in Control.Applicative
replicateA :: Applicative f => Int -> f a -> f [a]
replicateA n act = sequenceA (replicate n act)

Now, we want to parse an escaped character. We'll use a regex that matches characters and returns a character, so it's an RE Char Char. Ideally I'd write it this way:

escaped :: RE Char Char
escaped = do
    string "\\u"
    digits <- replicateM 4 (psym isHexDigit)
    return . chr . fst . head . readHex $ digits

The head is safe because we've ensured that readHex will only be passed hex digits, and therefore will succeed. We can almost write it like that, except that RE Char is not a Monad. With newish GHC's you can probably turn on ApplicativeDo and be done with it, but it's not so bad to write in applicative style ourselves anyway and support all GHC's, so let's do that:

escaped :: RE Char Char
escaped
    =   chr . fst . head . readHex
    <$> (string "\\u"
     *>  replicateA 4 (psym isHexDigit)
        )

Anyway, once we have a regex for decoding a single escaped character, it's easy to produce a regex for decoding all the escaped characters and passing unescaped characters through unchanged: many (escaped <|> anySym). Since this regex will always succeed, we can ignore the Maybe-ness of (=~) hedging its bets about whether an expression will match, and write

decodeHex :: String -> String
decodeHex = fromJust . (=~ many (escaped <|> anySym))

Let's try it in ghci:

> decodeHex "\\u00b5ABC"
"\181ABC"
> decodeHex "\\u00bABC"
"\186BC"
> decodeHex "\\udefg"
"\\udefg"

The advantage of writing our own parser like this instead of relying on something like decode is that we gain control and confidence over exactly which transformations are being done; for example, since we know \u will always be followed by four hex digits, we can only transform it when that happens, in case the original, pre-Javascript text contained \\udefg and we want that to appear in the final output, rather than \3567g; and we don't have to worry that it is trying to de-escape other things that we don't want it to do; and we don't have to "extra-escape" our string before we hand it off, either, as you do with adding the extra quotes around it. And of course, the disadvantage is that we had to engineer it ourselves, and probably have less confidence in its correctness since it hasn't been battle-hardened by a thousand users!

like image 194
Daniel Wagner Avatar answered Sep 21 '22 02:09

Daniel Wagner