Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decode or unescape \u00f0\u009f\u0091\u008d to ๐Ÿ‘

We all know UTF-8 is hard. I exported my messages from Facebook and the resulting JSON file escaped all non-ascii characters to unicode code points.

I am looking for an easy way to unescape these unicode code points to regular old UTF-8. I also would love to use PowerShell.

I tried

$str = "\u00f0\u009f\u0091\u008d"
[Regex]::Replace($str, "\\[Uu]([0-9A-Fa-f]{4})", `
{[char]::ToString([Convert]::ToInt32($args[0].Groups[1].Value, 16))} )

but that only gives me รฐ as a result, not ๐Ÿ‘.

I also tried using Notepad++ and I found this SO post: How to convert escaped Unicode (e.g. \u0432\u0441\u0435) to UTF-8 chars (ะฒัะต) in Notepad++. The accepted answer also results in exactly the same as the example above: รฐ.

I found the decoding solution here: the UTF8.js library that decodes the text perfectly and you can try it out here (with \u00f0\u009f\u0091\u008d as input).

Is there a way in PowerShell to decode \u00f0\u009f\u0091\u008d to receive ๐Ÿ‘? I'd love to have real UTF-8 in my exported Facebook messages so I can actually read them.

Bonus points for helping me understand what \u00f0\u009f\u0091\u008d actually represents (besides it being some UTF-8 hex representation). Why is it the same as U+1F44D or \uD83D\uDC4D in C++?

like image 989
Dennis G Avatar asked Jun 12 '18 22:06

Dennis G


1 Answers

The Unicode code point of the ๐Ÿ‘character is U+1F44D.

Using the variable-length UTF-8 encoding, the following 4 bytes (expressed as hex. numbers) are needed to represent this code point: F0 9F 91 8D.

While these bytes are recognizable in your string,

$str = "\u00f0\u009f\u0091\u008d"

they shouldn't be represented as \u escape codes, because they're not Unicode code units / code point, they're bytes.

With a 4-hex-digit escape sequence (UTF-16), the proper representation would require 2 16-bit Unicode code units, a so-called surrogate pair, which together represent the single non-BMP code point U+1F44D:

$str = "\uD83D\uDC4D"

If your JSON input used such proper Unicode escapes, PowerShell would process the string correctly; e.g.:

'{ "str": "\uD83D\uDC4D" }' | ConvertFrom-Json > out.txt

If you examine file out.txt, you'll see something like:

str
---
๐Ÿ‘ 

(The output was sent to a file, because console windows wouldn't render the ๐Ÿ‘char. correctly, at least not without additional configuration; note that if you used PowerShell Core on Linux or macOS, however, terminal output would work.)


Therefore, the best solution would be to correct the problem at the source and use proper Unicode escapes (or even use the characters themselves, as long as the source supports any of the standard Unicode encodings).

If you really must parse the broken representation, try the following workaround (PSv4+), building on your own [regex]::Replace() technique:

$str = "A \u00f0\u009f\u0091\u008d for Mot\u00c3\u00b6rhead."

[regex]::replace($str, '(?:\\u[0-9a-f]{4})+', { param($m) 
  $utf8Bytes = (-split ($m.Value -replace '\\u([0-9a-f]{4})', '0x$1 ')).ForEach([byte])
  [text.encoding]::utf8.GetString($utf8Bytes)
})

This should yield A ๐Ÿ‘ for Motรถrhead.

The above translates sequences of \u... escapes into the byte values they represent and interprets the resulting byte array as UTF-8 text.


To save the decoded string to a UTF-8 file, use ... | Set-Content -Encoding utf8 out.txt

Alternatively, in PSv5+, as Dennis himself suggests, you can make Out-File and therefore it's virtual alias, >, default to UTF-8 via PowerShell's global parameter-defaults hashtable:

$PSDefaultParameterValues['Out-File:Encoding'] = 'utf8'

Note, however, that on Windows PowerShell (as opposed to PowerShell Core) you'll get an UTF-8 file with a BOM in both cases - avoiding that requires direct use of the .NET framework: see Using PowerShell to write a file in UTF-8 without the BOM

like image 147
mklement0 Avatar answered Oct 03 '22 06:10

mklement0