Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert hexadecimal character (ligature) to utf-8 character

I had a text content which is converted from a pdf file. There are some unwanted character in the text and I want to convert them to utf-8 characters.

For instance; 'Artificial Immune System' is converted like 'Articial Immune System'. is converted like a one character and I used gdex to learn the ascii value of the character but I don't know how to replace it with the real value in the all content.

like image 624
Barbaros26 Avatar asked Feb 07 '12 11:02

Barbaros26


1 Answers

I guess what you're seeing are ligatures — professional fonts have glyps that combine several individual characters into a single (better looking) glyph. So instead of writing "f" and "i", as two glyphs, the font has a single "fi" glyph. Compare "fi" (two letters) with "fi" (single glyph).

In Python, you can use the unicodedata module to manipute late Unicode text. You can also exploit the conversion to NFKD normal form to split ligatures:

>>> import unicodedata
>>> unicodedata.name(u'\uFB01')
'LATIN SMALL LIGATURE FI'
>>> unicodedata.normalize("NFKD", u'Arti\uFB01cial Immune System')
u'Artificial Immune System'

So normalizing your strings with NFKD should help you along. If you find that this splits too much, then my best suggestion is to make a small mapping table of the ligatures you want to split and replace the ligatures manually:

>>> ligatures = {0xFB00: u'ff', 0xFB01: u'fi'}
>>> u'Arti\uFB01cial Immune System'.translate(ligatures)
u'Artificial Immune System'

Refer to the Wikipedia article to get a list of ligatures in Unicode.

like image 192
Martin Geisler Avatar answered Sep 20 '22 13:09

Martin Geisler