I am parsing some XML text though an API without saving the actual file and have run into an issue when the text includes characters from other languages.
When trying to convert 'ë' or others like this, I end up with the text é instead. Is there a way to do change encoding of a variable within memory as I am not using any files.
Any help would be greatly appreciated.
It looks like character encoding of the original text was misinterpreted when the text was converted to .NET strings.
Specifically, it looks like UTF-8-encoded text was misinterpreted as "ANSI"-encoded or, in the context of cmdlets such as Invoke-WebRequest, as a similar fixed-width single-byte encoding such as ISO-8859-1, so that each byte in the UTF-8 input became a character in its own right, even though UTF-8 encodes non-ASCII-range characters as multiple bytes.
To correct this problem, you must re-encode the string:
convert the misinterpreted string back to bytes using the input string's mistakenly applied encoding, so as to get the original byte representation.
then reconvert these bytes back to a string using the true encoding, namely UTF-8.
# Note: Works in Windows PowerShell only - in PowerShell Core,
# [Text.Encoding]::Default is *invariably* UTF-8.
$originalBytes = [Text.Encoding]::Default.GetBytes('é')
[Text.Encoding]::Utf8.GetString($originalBytes)
The above yields é.
In Windows PowerShell, [Text.Encoding]::Default is your system's "ANSI" encoding; for ISO-8859-1 encoding, use [Text.Encoding]::GetEncoding(28591)
Note that the entire problem would not have arisen in PowerShell Core, which consistently defaults to (BOM-less) UTF-8.
Should you find yourself in need of using the "ANSI" encoding even in PowerShell Core, see this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With