Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make Popen() understand UTF-8 properly?

Tags:

python

This is my code in Python:

[...]
proc = Popen(path, stdin=stdin, stdout=PIPE, stderr=PIPE)
result = [x for x in proc.stdout.readlines()]
result = ''.join(result);

Everything works fine, when it's ASCII. When I'm receiving UTF-8 text in stdout the result is unpredictable. In most cases the output is damaged. What is wrong here?

Btw, maybe this code should be optimized somehow?

like image 807
yegor256 Avatar asked Oct 13 '10 19:10

yegor256


People also ask

How do I use Popen in Python?

Python method popen() opens a pipe to or from command. The return value is an open file object connected to the pipe, which can be read or written depending on whether mode is 'r' (default) or 'w'. The bufsize argument has the same meaning as in open() function.

How does subprocess Popen work?

The subprocess module defines one class, Popen and a few wrapper functions that use that class. The constructor for Popen takes arguments to set up the new process so the parent can communicate with it via pipes. It provides all of the functionality of the other modules and functions it replaces, and more.

Does Popen need to be closed?

Popen do we need to close the connection or subprocess automatically closes the connection? Usually, the examples in the official documentation are complete. There the connection is not closed. So you do not need to close most probably.


2 Answers

Have you tried decoding your string, and then combining your UTF-8 strings together? In Python 2.4+ (at least), this can be achieved with

result = [x.decode('utf8') for x in proc.stdout.readlines()]

The important point is that your lines x are sequences of bytes that must be interpreted as representing characters. The decode() method performs this interpretation (here, the bytes are assumed to be in the UTF-8 encoding): x.decode('utf8') is of type unicode, which you can think of as "string of characters" (which is different from "string of numbers between 0 and 255 [bytes]").

like image 173
Eric O Lebigot Avatar answered Sep 22 '22 23:09

Eric O Lebigot


I run into the same issue when using LogPipe.

I solved this by specifying additional arguments encoding='utf-8', errors='ignore' to fdopen().

# https://codereview.stackexchange.com/questions/6567/redirecting-subprocesses-output-stdout-and-stderr-to-the-logging-module
class LogPipe(threading.Thread):
    def __init__(self):
        """Setup the object with a logger and a loglevel
        and start the thread
        """
        threading.Thread.__init__(self)
        self.daemon = False
        # self.level = level
        self.fdRead, self.fdWrite = os.pipe()
        self.pipeReader = os.fdopen(self.fdRead, encoding='utf-8', errors='ignore')  # set utf-8 encoding and just ignore illegal character
        self.start()

    def fileno(self):
        """Return the write file descriptor of the pipe
        """
        return self.fdWrite

    def run(self):
        """Run the thread, logging everything.
        """
        for line in iter(self.pipeReader.readline, ''):
            # vlogger.log(self.level, line.strip('\n'))
            vlogger.debug(line.strip('\n'))

        self.pipeReader.close()

    def close(self):
        """Close the write end of the pipe.
        """
        os.close(self.fdWrite)
like image 39
hailinzeng Avatar answered Sep 24 '22 23:09

hailinzeng