Firstly, I apologize if the terms "unescaped unicode" and "utf8 integer" are not correct; I don't really know what I'm talking about when I'm talking about encoding.
As a concrete example, I would like to convert the string "\\u00b5ABC"
to the string "\181ABC"
(\u00b5
and \181
correspond to µ
). By "string" I mean String
or Text
.
I know how to achieve this by using a tortuous (and perhaps laughable) way:
import Data.Aeson (decode)
import Data.ByteString.Lazy (packChars)
import Data.Text (Text)
decode (packChars "\"\\u00b5ABC\"") :: Maybe Text
I am ready to bet there exists a more direct way...
Following @Alec's comment, I provide more context. In the background, there is a Javascript program that receives a character string and replaces the characters in this string by their unicode representation \\uxxxx
when this unicode representation is between \u007F
and \uFFFF
.
On the Haskell side, I receive this new string, and I want to replace the \\uxxxx
with their corresponding utf8 integer representations.
Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.
The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).
You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.
Here's a nice simple parser written using regex-applicative. First some imports and other nonsense that isn't worth reading:
import Data.Char
import Data.Maybe
import Numeric
import Text.Regex.Applicative
-- no idea why this isn't in Control.Applicative
replicateA :: Applicative f => Int -> f a -> f [a]
replicateA n act = sequenceA (replicate n act)
Now, we want to parse an escaped character. We'll use a regex that matches characters and returns a character, so it's an RE Char Char
. Ideally I'd write it this way:
escaped :: RE Char Char
escaped = do
string "\\u"
digits <- replicateM 4 (psym isHexDigit)
return . chr . fst . head . readHex $ digits
The head
is safe because we've ensured that readHex
will only be passed hex digits, and therefore will succeed. We can almost write it like that, except that RE Char
is not a Monad
. With newish GHC's you can probably turn on ApplicativeDo
and be done with it, but it's not so bad to write in applicative style ourselves anyway and support all GHC's, so let's do that:
escaped :: RE Char Char
escaped
= chr . fst . head . readHex
<$> (string "\\u"
*> replicateA 4 (psym isHexDigit)
)
Anyway, once we have a regex for decoding a single escaped character, it's easy to produce a regex for decoding all the escaped characters and passing unescaped characters through unchanged: many (escaped <|> anySym)
. Since this regex will always succeed, we can ignore the Maybe
-ness of (=~)
hedging its bets about whether an expression will match, and write
decodeHex :: String -> String
decodeHex = fromJust . (=~ many (escaped <|> anySym))
Let's try it in ghci:
> decodeHex "\\u00b5ABC"
"\181ABC"
> decodeHex "\\u00bABC"
"\186BC"
> decodeHex "\\udefg"
"\\udefg"
The advantage of writing our own parser like this instead of relying on something like decode
is that we gain control and confidence over exactly which transformations are being done; for example, since we know \u
will always be followed by four hex digits, we can only transform it when that happens, in case the original, pre-Javascript text contained \\udefg
and we want that to appear in the final output, rather than \3567g
; and we don't have to worry that it is trying to de-escape other things that we don't want it to do; and we don't have to "extra-escape" our string before we hand it off, either, as you do with adding the extra quotes around it. And of course, the disadvantage is that we had to engineer it ourselves, and probably have less confidence in its correctness since it hasn't been battle-hardened by a thousand users!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With