I'm trying to write a (very) basic web crawler using cURL and Python's <code>BeautifulSoup</code> library (since this is much easier to understand than GNU awk and a mess of regular expressions). Currently, I'm trying to pipe the contents of a webpage to the program with cURL (i.e., <code>curl http://www.example.com/ | ./parse-html.py</code>) For some reason, Python throws a <code>UnicodeDecodeError</code> because of an invalid start byte (I've looked at this answer and this answer about invalid start bytes, but did not figure out how to solve the issue from them). Specifically, I've tried to use <code>a.encode('utf-8').split()</code> from the first answer. The second answer simply explained the issue (that Python found an invalid starter byte), though it didn't give a solution. I've attempted redirecting the output of cURL to a file (i.e., <code>curl http://www.example.com/ > foobar.html</code> and modifying the program to accept a file as a command-line argument, though this causes the same <code>UnicodeDecodeError</code>. I've checked, and the output of <code>locale charmap</code> is <code>UTF-8</code>, which as far as I know, means that my system is encoding characters in <code>UTF-8</code> (which makes me especially confused about this <code>UnicodeDecodeError</code>. At the moment, the exact line causing the error is <code>html_doc = sys.stdin.readlines().encode('utf-8').strip()</code>. I've tried rewriting this as a for-loop, though I get the same issue. What exactly is causing the <code>UnicodeDecodeError</code> and how should I fix the issue? EDIT: By changing the line <code>html_doc = sys.stdin.readlines().encode('utf-8').strip()</code> to <code>html_doc = sys.stdin</code> fixes the issue

The problem is during reading, not encoding; the input resource is simply not encoded with UTF-8, but another encoding. In a UTF-8 shell, you can easily reproduce the problem with <pre class="prettyprint"><code>$ echo 2¥ | iconv -t iso8859-1 | python3 -c 'import sys;sys.stdin.readline()' Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 1: invalid start byte </code></pre> You can read the file (<code>sys.stdin.buffer.read()</code>, or <code>with open(..., 'rb') as f: f.read()</code>) as binary (you'll get a <code>bytes</code> object), examine it, and guess the encoding. The actual algorithm to do that is documented in the HTML standard. However, in many cases the encoding is not specified in the file itself, but via the HTTP <code>Content-Type</code> header. Unfortunately, your invocation of curl does not capture this header. Instead of using curl and Python, you can simply use Python only - it already can download URLs. Stealing the encoding detection algorithm from youtube-dl, we get something like: <pre class="prettyprint"><code>import re import urllib.request def guess_encoding(content_type, webpage_bytes): m = re.match( r'[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+\s*;\s*charset="?([a-zA-Z0-9_-]+)"?', content_type) if m: encoding = m.group(1) else: m = re.search(br'<meta[^>]+charset=[\'"]?([a-zA-Z0-9_-]+)[ /\'">]', webpage_bytes[:1024]) if m: encoding = m.group(1).decode('ascii') elif webpage_bytes.startswith(b'\xff\xfe'): encoding = 'utf-16' else: encoding = 'utf-8' return encoding def download_html(url): with urllib.request.urlopen(url) as urlh: content = urlh.read() encoding = guess_encoding(urlh.getheader('Content-Type'), content) return content.decode(encoding) print(download_html('https://phihag.de/2016/iso8859.php')) </code></pre> There are also some libraries (though not in the standard library) which support this out of the box, namely requests. I also recommend that you read up on the basics of what encodings are.

Python sys.stdin throws a UnicodeDecodeError

Tags:

python-3.x

encoding

utf-8

sys

I'm trying to write a (very) basic web crawler using cURL and Python's BeautifulSoup library (since this is much easier to understand than GNU awk and a mess of regular expressions).

Currently, I'm trying to pipe the contents of a webpage to the program with cURL (i.e., curl http://www.example.com/ | ./parse-html.py)

For some reason, Python throws a UnicodeDecodeError because of an invalid start byte (I've looked at this answer and this answer about invalid start bytes, but did not figure out how to solve the issue from them).

Specifically, I've tried to use a.encode('utf-8').split() from the first answer. The second answer simply explained the issue (that Python found an invalid starter byte), though it didn't give a solution.

I've attempted redirecting the output of cURL to a file (i.e., curl http://www.example.com/ > foobar.html and modifying the program to accept a file as a command-line argument, though this causes the same UnicodeDecodeError.

I've checked, and the output of locale charmap is UTF-8, which as far as I know, means that my system is encoding characters in UTF-8 (which makes me especially confused about this UnicodeDecodeError.

At the moment, the exact line causing the error is html_doc = sys.stdin.readlines().encode('utf-8').strip(). I've tried rewriting this as a for-loop, though I get the same issue.

What exactly is causing the UnicodeDecodeError and how should I fix the issue?

EDIT: By changing the line html_doc = sys.stdin.readlines().encode('utf-8').strip() to html_doc = sys.stdin fixes the issue

999

asked Jan 20 '16 02:01

Charles German

1 Answers

The problem is during reading, not encoding; the input resource is simply not encoded with UTF-8, but another encoding. In a UTF-8 shell, you can easily reproduce the problem with

$ echo 2¥ | iconv -t iso8859-1 | python3 -c 'import sys;sys.stdin.readline()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 1: invalid start byte

You can read the file (sys.stdin.buffer.read(), or with open(..., 'rb') as f: f.read()) as binary (you'll get a bytes object), examine it, and guess the encoding. The actual algorithm to do that is documented in the HTML standard.

However, in many cases the encoding is not specified in the file itself, but via the HTTP Content-Type header. Unfortunately, your invocation of curl does not capture this header. Instead of using curl and Python, you can simply use Python only - it already can download URLs. Stealing the encoding detection algorithm from youtube-dl, we get something like:

import re
import urllib.request


def guess_encoding(content_type, webpage_bytes):
    m = re.match(
        r'[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+\s*;\s*charset="?([a-zA-Z0-9_-]+)"?',
        content_type)
    if m:
        encoding = m.group(1)
    else:
        m = re.search(br'<meta[^>]+charset=[\'"]?([a-zA-Z0-9_-]+)[ /\'">]',
                      webpage_bytes[:1024])
        if m:
            encoding = m.group(1).decode('ascii')
        elif webpage_bytes.startswith(b'\xff\xfe'):
            encoding = 'utf-16'
        else:
            encoding = 'utf-8'

    return encoding


def download_html(url):
    with urllib.request.urlopen(url) as urlh:
        content = urlh.read()
        encoding = guess_encoding(urlh.getheader('Content-Type'), content)
        return content.decode(encoding)

print(download_html('https://phihag.de/2016/iso8859.php'))

There are also some libraries (though not in the standard library) which support this out of the box, namely requests.

I also recommend that you read up on the basics of what encodings are.

197

answered Sep 27 '22 19:09

phihag

Related questions
                            
                                FileRequiredValidator() doesn't work when using MultipleFileField() in my form
                            
                                detect key press in python, where each iteration can take more than a couple of seconds?
                            
                                bottle : how to set a cookie inside a python decorator?
                            
                                Why does this solution work in Javascript but not in Python? (Dynamic programming)
                            
                                Cannot install Google Colab locally
                            
                                OpenCV VideoCapture returns strange frame offset for different versions
                            
                                How to add a percentage computation in pandas result
                            
                                Parsing Excel files with Python 3.x
                            
                                Python3 and xml/xslt libraries
                            
                                Python 3 Project Initialization / Prototyping
                            
                                How to build a GUI application with Python 3.3 and Qt 5? [closed]
                            
                                How to Populate List of Starting Positions of Each Line Using For-loop and Tell Function?
                            
                                What is a good way to support Python 2 in a Python 3 codebase when using PyPi?
                            
                                Python3 urllib.request will not close connections immediately
                            
                                Why doesn't my idea work in python2?
                            
                                Recursively import all .py files from all folders [duplicate]
                            
                                Spatialite with Python 2 and 3
                            
                                ImportError: No module named 'util'
                            
                                Open a sqlite3 database from an io.BytesIO stream?
                            
                                Make a string output as a list in Pwm

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With