Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle response encoding from urllib.request.urlopen() , to avoid TypeError: can't use a string pattern on a bytes-like object

I'm trying to open a webpage using urllib.request.urlopen() then search it with regular expressions, but that gives the following error:

TypeError: can't use a string pattern on a bytes-like object

I understand why, urllib.request.urlopen() returns a bytestream, so re doesn't know the encoding to use. What am I supposed to do in this situation? Is there a way to specify the encoding method in a urlrequest maybe or will I need to re-encode the string myself? If so what am I looking to do, I assume I should read the encoding from the header info or the encoding type if specified in the html and then re-encode it to that?

like image 931
kryptobs2000 Avatar asked Feb 13 '11 02:02

kryptobs2000


People also ask

What does Urllib request Urlopen do?

request is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols.

What does Urllib Urlopen return?

The data returned by urlopen() or urlretrieve() is the raw data returned by the server. This may be binary data (such as an image), plain text or (for example) HTML. The HTTP protocol provides type information in the reply header, which can be inspected by looking at the Content-Type header.

What is the protocol used and the use of Urllib request?

We use the urllib. request module in Python to access and open URLs, which most often use the HTTP protocol. The interface used is also very simple for beginners to use and learn; it uses the urlopen function which can fetch various URLs using a variety of different protocols.

What does Urllib request return?

The urllib. parse. urlencode() function takes a mapping or sequence of 2-tuples and returns an ASCII string in this format. It should be encoded to bytes before being used as the data parameter.


1 Answers

As for me, the solution is as following (python3):

resource = urllib.request.urlopen(an_url) content =  resource.read().decode(resource.headers.get_content_charset()) 
like image 118
Ivan Klass Avatar answered Sep 22 '22 06:09

Ivan Klass