Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get a regular expression to recognize non-ASCII characters as letters?

I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.

My problem is that when I print the information the öäå are gone.

I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.

So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)

EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8. EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site

like image 637
richie Avatar asked Apr 15 '12 19:04

richie


1 Answers

Always work in unicode and only convert to an encoded representation when necessary.

For this particular situation, you also need to use the re.U flag so \w matches unicode letters:

#coding: utf-8

import re

location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)

print location # prints öäå
like image 72
agf Avatar answered Sep 30 '22 21:09

agf