So I have this page:
http://hub.iis.sinica.edu.tw/cytoHubba/
Apparently it's all kinds of messed up, as it gets decoded properly but when I try to save it in postgres I get:
DatabaseError: invalid byte sequence for encoding "UTF8": 0xedbdbf
The database clams up after that and refuses to do anything without a rollback, which will be a bit hard to issue (long story). Is there a way for me to check if this will happen before it hits the database? source.encode("utf-8") works without a hitch, so I'm not sure what's going on...
There is a bug in python 2.x that is only fixed python 3.x. In fact, this bug is even in OS X's iconv (but not the glibc one).
Here's what's happening:
Python 2.x does not recognize UTF8 surrogate pairs [1] as being invalid (which is what your character sequence is)
This should be all that's needed:
foo.decode('utf8').encode('utf8')
But thanks to that bug they're not fixing, it doesn't catch surrogate pairs.
Try this in python 2.x and then in 3.x:
b'\xed\xbd\xbf'.decode('utf8')
It will throw an error (correctly) in the latter. They aren't fixing it in the 2.x branch either. See [2] and [3] for more info
[1] https://www.rfc-editor.org/rfc/rfc3629#section-4
[2] http://bugs.python.org/issue9133
[3] http://bugs.python.org/issue8271#msg102209
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With