Convert unescaped unicode to utf8 integer

Tags:

Firstly, I apologize if the terms "unescaped unicode" and "utf8 integer" are not correct; I don't really know what I'm talking about when I'm talking about encoding.

As a concrete example, I would like to convert the string "\\u00b5ABC" to the string "\181ABC" (\u00b5 and \181 correspond to µ). By "string" I mean String or Text.

I know how to achieve this by using a tortuous (and perhaps laughable) way:

import Data.Aeson (decode)
import Data.ByteString.Lazy (packChars)
import Data.Text (Text)
decode (packChars "\"\\u00b5ABC\"") :: Maybe Text

I am ready to bet there exists a more direct way...

Edit

Following @Alec's comment, I provide more context. In the background, there is a Javascript program that receives a character string and replaces the characters in this string by their unicode representation \\uxxxx when this unicode representation is between \u007F and \uFFFF.

On the Haskell side, I receive this new string, and I want to replace the \\uxxxx with their corresponding utf8 integer representations.

411

asked Feb 02 '17 21:02

Stéphane Laurent

1 Answers

Here's a nice simple parser written using regex-applicative. First some imports and other nonsense that isn't worth reading:

import Data.Char
import Data.Maybe
import Numeric
import Text.Regex.Applicative

-- no idea why this isn't in Control.Applicative
replicateA :: Applicative f => Int -> f a -> f [a]
replicateA n act = sequenceA (replicate n act)

Now, we want to parse an escaped character. We'll use a regex that matches characters and returns a character, so it's an RE Char Char. Ideally I'd write it this way:

escaped :: RE Char Char
escaped = do
    string "\\u"
    digits <- replicateM 4 (psym isHexDigit)
    return . chr . fst . head . readHex $ digits

The head is safe because we've ensured that readHex will only be passed hex digits, and therefore will succeed. We can almost write it like that, except that RE Char is not a Monad. With newish GHC's you can probably turn on ApplicativeDo and be done with it, but it's not so bad to write in applicative style ourselves anyway and support all GHC's, so let's do that:

escaped :: RE Char Char
escaped
    =   chr . fst . head . readHex
    <$> (string "\\u"
     *>  replicateA 4 (psym isHexDigit)
        )

Anyway, once we have a regex for decoding a single escaped character, it's easy to produce a regex for decoding all the escaped characters and passing unescaped characters through unchanged: many (escaped <|> anySym). Since this regex will always succeed, we can ignore the Maybe-ness of (=~) hedging its bets about whether an expression will match, and write

decodeHex :: String -> String
decodeHex = fromJust . (=~ many (escaped <|> anySym))

Let's try it in ghci:

> decodeHex "\\u00b5ABC"
"\181ABC"
> decodeHex "\\u00bABC"
"\186BC"
> decodeHex "\\udefg"
"\\udefg"

The advantage of writing our own parser like this instead of relying on something like decode is that we gain control and confidence over exactly which transformations are being done; for example, since we know \u will always be followed by four hex digits, we can only transform it when that happens, in case the original, pre-Javascript text contained \\udefg and we want that to appear in the final output, rather than \3567g; and we don't have to worry that it is trying to de-escape other things that we don't want it to do; and we don't have to "extra-escape" our string before we hand it off, either, as you do with adding the extra quotes around it. And of course, the disadvantage is that we had to engineer it ourselves, and probably have less confidence in its correctness since it hasn't been battle-hardened by a thousand users!

194

answered Sep 21 '22 02:09

Daniel Wagner

Related questions
                            
                                How do I parse a chemical formula using a regular expression?
                            
                                Finding the end of a substring match in .NET
                            
                                Fastest way to access VB6 String in C#
                            
                                String Length Evaluating Incorrectly
                            
                                Is there a simple method of converting an ordinal numeric string to its matching numeric value?
                            
                                String pattern matching problem
                            
                                How do I implement an array of strings?
                            
                                UTF-8 String class for java
                            
                                Is there a regex replacement term for the uppercase/lowercase version of a back reference? [duplicate]
                            
                                Argument 1 passed to myFunction() must be an instance of string, string given, called in
                            
                                C++11 internal std::string representation (libstdc++)
                            
                                How to split string to arguments like shell?
                            
                                Python 3.4 decode bytes
                            
                                How to expand a string within a string in python?
                            
                                JavaScript: difference in efficiency of indexOf method on String and Array
                            
                                How to remove a key/value pair from yaml dump, in Python?
                            
                                Achieve Matlab's `num2str` behaviour in Octave
                            
                                Java Integer addition with String
                            
                                Python3 src encodings of Emojis
                            
                                changing string delimiters to backticks : possible impact?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert unescaped unicode to utf8 integer

Tags:

string

haskell

unicode

utf-8

Edit

Stéphane Laurent

People also ask

1 Answers

Daniel Wagner

Recent Activity

Donate For Us