Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python sys.stdin throws a UnicodeDecodeError

I'm trying to write a (very) basic web crawler using cURL and Python's BeautifulSoup library (since this is much easier to understand than GNU awk and a mess of regular expressions).

Currently, I'm trying to pipe the contents of a webpage to the program with cURL (i.e., curl http://www.example.com/ | ./parse-html.py)

For some reason, Python throws a UnicodeDecodeError because of an invalid start byte (I've looked at this answer and this answer about invalid start bytes, but did not figure out how to solve the issue from them).

Specifically, I've tried to use a.encode('utf-8').split() from the first answer. The second answer simply explained the issue (that Python found an invalid starter byte), though it didn't give a solution.

I've attempted redirecting the output of cURL to a file (i.e., curl http://www.example.com/ > foobar.html and modifying the program to accept a file as a command-line argument, though this causes the same UnicodeDecodeError.

I've checked, and the output of locale charmap is UTF-8, which as far as I know, means that my system is encoding characters in UTF-8 (which makes me especially confused about this UnicodeDecodeError.

At the moment, the exact line causing the error is html_doc = sys.stdin.readlines().encode('utf-8').strip(). I've tried rewriting this as a for-loop, though I get the same issue.

What exactly is causing the UnicodeDecodeError and how should I fix the issue?

EDIT: By changing the line html_doc = sys.stdin.readlines().encode('utf-8').strip() to html_doc = sys.stdin fixes the issue

like image 999
Charles German Avatar asked Jan 20 '16 02:01

Charles German


People also ask

What does UnicodeDecodeError mean in Python?

The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str strings to unicode characters, an illegal sequence of str characters will cause the coding-specific decode() to fail.

What is an invalid start byte?

The Python "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte" occurs when we specify an incorrect encoding when decoding a bytes object. To solve the error, specify the correct encoding, e.g. utf-16 or open the file in binary mode ( rb or wb ).


1 Answers

The problem is during reading, not encoding; the input resource is simply not encoded with UTF-8, but another encoding. In a UTF-8 shell, you can easily reproduce the problem with

$ echo 2¥ | iconv -t iso8859-1 | python3 -c 'import sys;sys.stdin.readline()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 1: invalid start byte

You can read the file (sys.stdin.buffer.read(), or with open(..., 'rb') as f: f.read()) as binary (you'll get a bytes object), examine it, and guess the encoding. The actual algorithm to do that is documented in the HTML standard.

However, in many cases the encoding is not specified in the file itself, but via the HTTP Content-Type header. Unfortunately, your invocation of curl does not capture this header. Instead of using curl and Python, you can simply use Python only - it already can download URLs. Stealing the encoding detection algorithm from youtube-dl, we get something like:

import re
import urllib.request


def guess_encoding(content_type, webpage_bytes):
    m = re.match(
        r'[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+\s*;\s*charset="?([a-zA-Z0-9_-]+)"?',
        content_type)
    if m:
        encoding = m.group(1)
    else:
        m = re.search(br'<meta[^>]+charset=[\'"]?([a-zA-Z0-9_-]+)[ /\'">]',
                      webpage_bytes[:1024])
        if m:
            encoding = m.group(1).decode('ascii')
        elif webpage_bytes.startswith(b'\xff\xfe'):
            encoding = 'utf-16'
        else:
            encoding = 'utf-8'

    return encoding


def download_html(url):
    with urllib.request.urlopen(url) as urlh:
        content = urlh.read()
        encoding = guess_encoding(urlh.getheader('Content-Type'), content)
        return content.decode(encoding)

print(download_html('https://phihag.de/2016/iso8859.php'))

There are also some libraries (though not in the standard library) which support this out of the box, namely requests.

I also recommend that you read up on the basics of what encodings are.

like image 197
phihag Avatar answered Sep 27 '22 19:09

phihag