How do I get a regular expression to recognize non-ASCII characters as letters?

Question

I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.

My problem is that when I print the information the öäå are gone.

I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.

So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)

EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8. EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site

agf · Accepted Answer

Always work in unicode and only convert to an encoded representation when necessary.

For this particular situation, you also need to use the re.U flag so \w matches unicode letters:

#coding: utf-8

import re

location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)

print location # prints öäå

How do I get a regular expression to recognize non-ASCII characters as letters?

Tags:

python

regex

character-encoding

ascii

utf-8

richie

1 Answers

agf

Recent Activity

Donate For Us

How do I get a regular expression to recognize non-ASCII characters as letters?

Tags:

python

regex

character-encoding

ascii

utf-8

richie

1 Answers

agf

Related questions

Recent Activity

Donate For Us