Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using unicode (Hebrew characters) with regular expression

I wrote script that find expressions in web page:

import sre, urllib2, sys, BaseHTTPServer
# -*- coding: utf-8 -*-    
address = sys.argv[1]
web_handle = urllib2.urlopen(address)
website_text = website_handle.read()    
matches = sre.findall(u"עברית", website_text)
for item in matches:
    print iten

This script works if I use a "regular" regular expression (without Hebrew characters) and doesn't match anything if I use them. What am I doing wrong?

edit example: url = https://en.wikipedia.org/wiki/Category:Countries

like image 951
Sanich Avatar asked Apr 17 '26 18:04

Sanich


1 Answers

You need to ensure that the input string is also in UTF8 format.

Use unicode function with utf-8 as second argument:

website_text = unicode(website_text, "utf-8")

Everything should be in consistent encoding for unicode to work in Python 2.

like image 191
Wiktor Stribiżew Avatar answered Apr 20 '26 08:04

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!