I'm trying to open a webpage using urllib.request.urlopen()
then search it with regular expressions, but that gives the following error:
TypeError: can't use a string pattern on a bytes-like object
I understand why, urllib.request.urlopen()
returns a bytestream, so re
doesn't know the encoding to use. What am I supposed to do in this situation? Is there a way to specify the encoding method in a urlrequest maybe or will I need to re-encode the string myself? If so what am I looking to do, I assume I should read the encoding from the header info or the encoding type if specified in the html and then re-encode it to that?
request is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols.
The data returned by urlopen() or urlretrieve() is the raw data returned by the server. This may be binary data (such as an image), plain text or (for example) HTML. The HTTP protocol provides type information in the reply header, which can be inspected by looking at the Content-Type header.
We use the urllib. request module in Python to access and open URLs, which most often use the HTTP protocol. The interface used is also very simple for beginners to use and learn; it uses the urlopen function which can fetch various URLs using a variety of different protocols.
The urllib. parse. urlencode() function takes a mapping or sequence of 2-tuples and returns an ASCII string in this format. It should be encoded to bytes before being used as the data parameter.
As for me, the solution is as following (python3):
resource = urllib.request.urlopen(an_url) content = resource.read().decode(resource.headers.get_content_charset())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With