How can I check a Python unicode string to see that it *actually* is proper Unicode?

Question

So I have this page:

http://hub.iis.sinica.edu.tw/cytoHubba/

Apparently it's all kinds of messed up, as it gets decoded properly but when I try to save it in postgres I get:

DatabaseError: invalid byte sequence for encoding "UTF8": 0xedbdbf

The database clams up after that and refuses to do anything without a rollback, which will be a bit hard to issue (long story). Is there a way for me to check if this will happen before it hits the database? source.encode("utf-8") works without a hitch, so I'm not sure what's going on...

mikelikespie · Accepted Answer

There is a bug in python 2.x that is only fixed python 3.x. In fact, this bug is even in OS X's iconv (but not the glibc one).

Here's what's happening:

Python 2.x does not recognize UTF8 surrogate pairs [1] as being invalid (which is what your character sequence is)

This should be all that's needed:

foo.decode('utf8').encode('utf8')

But thanks to that bug they're not fixing, it doesn't catch surrogate pairs.

Try this in python 2.x and then in 3.x:

b'\xed\xbd\xbf'.decode('utf8')

It will throw an error (correctly) in the latter. They aren't fixing it in the 2.x branch either. See [2] and [3] for more info

[1] https://www.rfc-editor.org/rfc/rfc3629#section-4

[2] http://bugs.python.org/issue9133

[3] http://bugs.python.org/issue8271#msg102209

How can I check a Python unicode string to see that it actually is proper Unicode?

Tags:

python

postgresql

unicode

Stavros Korokithakis

1 Answers

mikelikespie

Recent Activity

Donate For Us