Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fluphenazine read as \xef\xac\x82uphenazine

Tags:

python

unicode

When I write

>>> st = "Piperazine (perphenazine, fluphenazine)"

>>> st

'Piperazine (perphenazine, \xef\xac\x82uphenazine)'

What is happening? why doesn't it do this for any fl? How do I avoid this?

It looks \xef\xac\x82 is not, in fact, fl. Is there any way to 'translate' this character into fl (as the author intended it), without just excluding it via something like

 unicode(st, errors='ignore').encode('ascii') 
like image 260
user0 Avatar asked Jul 22 '15 03:07

user0


1 Answers

This is what is called a "ligature".

In printing, the f and l characters were typeset with a different amount of space between them from what normal pairs of sequential letters used - in fact, the f and l would merge into one character. Other ligatures include "th", "oe", and "st".

That's what you're getting in your input - the "fl" ligature character, UTF-8 encoded. It's a three-byte sequence. I would take minor issue with your assertion that it's "not, in fact fl" - it really is, but your input is UTF-8 and not ASCII :-). I'm guessing you pasted from a Word document or an ebook or something that's designed for presentation instead of data fidelity (or perhaps, from the content, it was a LaTeX-generated PDF?).

If you want to handle this particular case, you could replace that byte sequence with the ASCII letters "fl". If you want to handle all such cases, you will have to use the Unicode Consortium's "UNIDATA" file at: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt . In that file, there is a column for the "decomposition" of a character. The f-l ligature has the identifier "LATIN SMALL LIGATURE FL". There is, by the way, a Python module for this data file at https://docs.python.org/2/library/unicodedata.html . You want the "decomposition" function:

>>> import unicodedata
>>> foo = u"fluphenazine"
>>> unicodedata.decomposition(foo[0])
'<compat> 0066 006C'

0066 006C is, of course, ASCII 'f' and 'l'.

Be aware that if you're trying to downcast UTF-8 data to ASCII, you're eventually going to have a bad day. There are only 127 ASCII characters, and UTF-8 has millions upon millions of code points. There are many codepoints in UTF-8 that cannot be readily represented as ASCII in a nonconvoluted way - who wants to have some text end up saying "<TREBLE CLEF> <SNOWMAN> <AIRPLANE> <YELLOW SMILEY FACE>"?

like image 168
Borealid Avatar answered Oct 25 '22 10:10

Borealid