Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to decide the Ligature for "FI" in Java (and others)

We have a system that parses PDF files and pulls out the text inside for indexing and such. One problem we have been having is that Illustrator sets words that contain "fi" to use the ligature for fi (single glyph).

For example this line...

"bench and rich vitrified ceramic tile."

Shows up like this in my Java debugger

"ete bench and rich vitri\u001Fed ceramic tile."

It appears that \u001F is the character code Adobe PDF files use for the ligature "fi". I could obviously swap out occurences of \u001F for "fi" but does anybody know of a robust way to handle this and cases like it?

like image 622
benstpierre Avatar asked Oct 09 '22 00:10

benstpierre


1 Answers

The sequence of bytes used as operand for 'show text' operators in PDF (TJ, Tj, etc) should be transformed into text using the encoding of the active font in the graphic state and the ToUnicode cmap associated with the font. Some fonts include a ToUnicode cmap that maps the 0x001F code (or whatever code it used for the glyph) to characters 'f' and 'l'. Other fonts use an encoding with a /Differences array that maps the code 0x1F to character /fl. These structures must be processed in order to get correct results.

like image 59
iPDFdev Avatar answered Oct 15 '22 10:10

iPDFdev