I had a text content which is converted from a pdf file. There are some unwanted character in the text and I want to convert them to utf-8 characters.
For instance; 'Artificial Immune System' is converted like 'Artificial Immune System'. fi is converted like a one character and I used gdex
to learn the ascii
value of the character but I don't know how to replace it with the real value in the all content.
I guess what you're seeing are ligatures — professional fonts have glyps that combine several individual characters into a single (better looking) glyph. So instead of writing "f" and "i", as two glyphs, the font has a single "fi" glyph. Compare "fi" (two letters) with "fi" (single glyph).
In Python, you can use the unicodedata
module to manipute late Unicode text. You can also exploit the conversion to NFKD normal form to split ligatures:
>>> import unicodedata
>>> unicodedata.name(u'\uFB01')
'LATIN SMALL LIGATURE FI'
>>> unicodedata.normalize("NFKD", u'Arti\uFB01cial Immune System')
u'Artificial Immune System'
So normalizing your strings with NFKD should help you along. If you find that this splits too much, then my best suggestion is to make a small mapping table of the ligatures you want to split and replace the ligatures manually:
>>> ligatures = {0xFB00: u'ff', 0xFB01: u'fi'}
>>> u'Arti\uFB01cial Immune System'.translate(ligatures)
u'Artificial Immune System'
Refer to the Wikipedia article to get a list of ligatures in Unicode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With