Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python TypeError on regex [duplicate]

So, I have this code:

url = 'http://google.com' linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>') m = urllib.request.urlopen(url) msg = m.read() links = linkregex.findall(msg) 

But then python returns this error:

links = linkregex.findall(msg) TypeError: can't use a string pattern on a bytes-like object 

What did I do wrong?

like image 834
kamikaze_pilot Avatar asked Mar 03 '11 17:03

kamikaze_pilot


2 Answers

TypeError: can't use a string pattern on a bytes-like object

what did i do wrong??

You used a string pattern on a bytes object. Use a bytes pattern instead:

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')                        ^             Add the b there, it makes it into a bytes object 

(ps:

 >>> from disclaimer include dont_use_regexp_on_html  "Use BeautifulSoup or lxml instead." 

)

like image 150
Lennart Regebro Avatar answered Oct 14 '22 18:10

Lennart Regebro


If you are running Python 2.6 then there isn't any "request" in "urllib". So the third line becomes:

m = urllib.urlopen(url)  

And in version 3 you should use this:

links = linkregex.findall(str(msg)) 

Because 'msg' is a bytes object and not a string as findall() expects. Or you could decode using the correct encoding. For instance, if "latin1" is the encoding then:

links = linkregex.findall(msg.decode("latin1")) 
like image 23
Morten Kristensen Avatar answered Oct 14 '22 17:10

Morten Kristensen