Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup and Unicode Problems

I'm using BeautifulSoup to parse some web pages.

Occasionally I run into a "unicode hell" error like the following :

Looking at the source of this article on TheAtlantic.com [ http://www.theatlantic.com/education/archive/2013/10/why-are-hundreds-of-harvard-students-studying-ancient-chinese-philosophy/280356/ ]

We see this in the og:description meta property :

<meta property="og:description" content="The professor who teaches&nbsp;Classical Chinese Ethical and Political Theory claims, &quot;This course will change your life.&quot;" />

When BeautifulSoup parses it, I see this:

>>> print repr(description)
u'The professor who teaches\xa0Classical Chinese Ethical and Political Theory claims, "This course will change your life."'

If I try encoding it to UTF-8 , like this SO comment suggests : https://stackoverflow.com/a/10996267/442650

>>> print repr(description.encode('utf8'))
'The professor who teaches\xc2\xa0Classical Chinese Ethical and Political Theory claims, "This course will change your life."'

Just when I thought I had all my unicode issues under control, I still don't quite understand what's going on, so I'm going to lay out a few questions:

1- why would BeautifulSoup convert the &nbsp; to \xa0 [a latin charset space character]? The charset and headers on this page are UTF-8, I thought BeautifulSoup pulls that data for the encoding ? Why wasn't it replaced with a <space> ?

2- Is there a common way to normalize whitespaces for conversion ?

3- When I encoded to UTF8 , where did \xa0 become the sequence of \xc2\xa0 ?

I can pipe everything through unicodedata.normalize('NFKD',string) to help get me where I want to be -- but I'd love to understand what's wrong and avoid problem like this in the future.

like image 301
Jonathan Vanasco Avatar asked Oct 22 '13 03:10

Jonathan Vanasco


1 Answers

You aren't encountering a problem. Everything is behaving as intended.

&nbsp; indicates a non-breaking space character. This isn't replaced with a space because it doesn't represent a space; it represents a non-breaking space. Replacing it with a space would lose information: that where that space occurs, a text rendering engine shouldn't put a line break.

The Unicode code point for non-breaking space is U+00A0, which is written in a Unicode string in Python as \xa0.

The UTF-8 encoding of U+00A0 is, in hexadecimal, the two byte sequence C2 A0, or written in a Python string representation, \xc2\xa0. In UTF-8, anything beyond the 7-bit ASCII set needs two or more bytes to represent it. In this case, the highest bit set is the eighth bit. That means that it can be represented by the two-byte sequence (in binary) 110xxxxx 10xxxxxx where the x's are the bits of the binary representation of the code point. In the case of A0, that is 10000000, or when encoded in UTF-8, 11000010 10000000 or C2 A0.

Many people use &nbsp; in HTML to get spaces which aren't collapsed by the usual HTML whitespace collapsing rules (in HTML, all runs of consecutive spaces, tabs, and newlines get interpreted as a single space unless one of the CSS white-space rules are applied), but that's not really what they are intended for; they are supposed to be used for things like names, like "Mr. Miyagi", where you don't want there to be a line break between the "Mr." and "Miyagi". I'm not sure why it was used in this particular case; it seems out of place here, but that's more of a problem with your source, not the code that interprets it.

Now, if you don't really care about layout so you don't mind whether or not text layout algorithms choose that as a place to wrap, but would like to interpret this merely as a regular space, normalizing using NFKD is a perfectly reasonable answer (or NFKC if you prefer pre-composed accents to decomposed accents). The NFKC and NFKD normalizations map characters such that most characters that represent essentially the same semantic value in most contexts are expanded out. For instance, ligatures are expanded out (ffi -> ffi), archaic long s characters are converted into s (ſ -> s), Roman numeral characters are expanded into their individual letters (Ⅳ -> IV), and non-breaking space converted into a normal space. For some characters, NFKC or NFKD normalization may lose information that is important in some contexts: ℌ and ℍ will both normalize to H, but in mathematical texts can be used to refer to different things.

like image 69
Brian Campbell Avatar answered Sep 27 '22 20:09

Brian Campbell