Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to ensure Python prints UTF-8 (and not UTF-16-LE) when piped in PowerShell?

I want to print text as UTF-8 when piped (to, for example, a file), so on Python 3.7.3 on Windows 10 via PowerShell, I'm doing this:

import sys

if not sys.stdout.isatty():
    sys.stdout.reconfigure(encoding='utf-8')

print("Mamma mia.")

When run as encodingtest.py > test.txt, test.txt then turns out to be this:

00000000  FF FE 4D 00 61 00 6D 00 6D 00 61 00 20 00 6D 00  ÿþM.a.m.m.a. .m.
00000010  69 00 61 00 2E 00 0D 00 0A 00                    i.a.......

Mysteriously enough, it starts with FF FE, which is the byte-order marker for UTF-16-LE – and null bytes are printed between the characters (as UTF-16 would have it)! However, when I run it via CMD rather than PowerShell, it prints UTF-8 just fine. How do I get Python to print UTF-8 even when piped via PowerShell?

I could run encodingtest.py | Out-File -Encoding UTF8 test.txt instead, but is there a way to ensure the output encoding program-side?

like image 747
obskyr Avatar asked Jul 22 '21 15:07

obskyr


1 Answers

PowerShell fundamentally doesn't support processing raw output (a stream of bytes) from external programs:

  • It invariably decodes such output as text, using the character encoding stored in [Console]::OutputEncoding

    • See this answer for more information.
  • Once decoded, it uses its default character encoding for file-output operations such as > (effectively an alias for the Out-File cmdlet), which for > are:

    • Windows PowerShell (up to v5.1): "Unicode", i.e. UTF-16LE (which is what you're seeing)
    • PowerShell (Core, v6+): BOM-less UTF-8 (now applied consistently across all cmdlets, unlike in Windows PowerShell).

In other words: Even use of just > involves a character decoding and re-encoding cycle, with no relationship between the original and the resulting encoding.


Therefore:

  • (Temporarily) set [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()

  • Pipe the output from your Python script call to Out-File - or, preferably, if the input is known to be strings already (always true for external-program calls) - Set-Content with Encoding utf8.

    • Caveat: In Windows PowerShell, you'll invariably get a UTF-8 file with a BOM (see this answer for a workaround). In PowerShell (Core), you'll get one without a BOM (as you would by default), but can opt to create one with -Encoding utf8BOM.

To put it all together (saving and restoring the original [Console]::OutputEncoding not shown):

[Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
encodingtest.py | Set-Content -Encoding utf8 test.txt

Modifying [Console]::OutputEncoding isn't necessary if you've switched to UTF-8 system-wide, as described in this answer, but note that this Windows 10 feature is still in beta as of this writing and has far-reaching consequences.


Alternatively, call via cmd.exe, which does pass the raw bytes through to a file with >:

cmd /c 'encodingtest.py > test.txt'

This technique (which analogously applies to Unix-like platforms via /bin/sh -c) is the general workaround for the lack of raw byte processing (see below).


Background information: Lack of support for raw byte streams in PowerShell's pipeline:

PowerShell's pipeline is object-based, which means that it is instances of .NET types that flow through it. This evolution of the traditional, binary-only pipeline is the key to PowerShell's power and versatility.

Everything in PowerShell is mediated via pipelines, including use of the redirection operator >, with ... > foo.txt in effect being syntactic sugar for ... | Out-File foo.txt

  • For PowerShell-native commands, which invariably output .NET objects, some form of encoding is necessary in order to write these objects to a file in a meaningful way (unless the objects are strings already, raw byte representations wouldn't make any sense), so text representations based on PowerShell's for-display output formatting systems are used (which, incidentally, is the reason why > with non-string input is generally unsuited to producing files for later programmatic processing).

  • For external programs, PowerShell has chosen to only ever communicate with them via text (strings), which on receiving output involves the inevitable decoding of the raw bytes received into .NET strings, as described above.

  • See this answer for more information.

This lack of support for raw byte streams is problematic: Unless you call the underlying .NET APIs directly to explicitly handle byte streams (which would be quite cumbersome), the cycle of decoding and re-encoding as text:

  • can alter the data, interfering not only with sending byte stream to files, but also with piping data between/to external programs; see this answer for an example.

  • can significantly degrade performance.

Historically, when PowerShell was a Windows-only shell, this wasn't much of a problem, because the Windows world didn't have many capable CLIs (command-line interfaces (utilities)) worth calling, so staying within the realm of PowerShell was usually sufficient (performance problems notwithstanding).

In an increasingly cross-platform world, however, and especially on Unix-like platforms, capable CLIs abound and are sometimes indispensable for high-performance operations.

Therefore, PowerShell should support raw byte streams at least on demand, and situationally even automatically when detecting that data is being piped between two external programs. See GitHub issue #1908 and GitHub issue #5974.

like image 190
mklement0 Avatar answered Oct 12 '22 23:10

mklement0