How to decide the Ligature for "FI" in Java (and others)

Question

We have a system that parses PDF files and pulls out the text inside for indexing and such. One problem we have been having is that Illustrator sets words that contain "fi" to use the ligature for fi (single glyph).

For example this line...

"bench and rich vitrified ceramic tile."

Shows up like this in my Java debugger

"ete bench and rich vitri\u001Fed ceramic tile."

It appears that \u001F is the character code Adobe PDF files use for the ligature "fi". I could obviously swap out occurences of \u001F for "fi" but does anybody know of a robust way to handle this and cases like it?

iPDFdev · Accepted Answer

The sequence of bytes used as operand for 'show text' operators in PDF (TJ, Tj, etc) should be transformed into text using the encoding of the active font in the graphic state and the ToUnicode cmap associated with the font. Some fonts include a ToUnicode cmap that maps the 0x001F code (or whatever code it used for the glyph) to characters 'f' and 'l'. Other fonts use an encoding with a /Differences array that maps the code 0x1F to character /fl. These structures must be processed in order to get correct results.

How to decide the Ligature for "FI" in Java (and others)

Tags:

java

character-encoding

pdf

ligature

benstpierre

1 Answers

iPDFdev

Recent Activity

Donate For Us

How to decide the Ligature for "FI" in Java (and others)

Tags:

java

character-encoding

pdf

ligature

benstpierre

1 Answers

iPDFdev

Related questions

Recent Activity

Donate For Us