Detect missing / corrupt Unicode mapping in PDF

Question

While extracting text from some PDFs PDFBox returns gibberish. This is because of a missing or corrupt Unicode mapping. I can see following warnings on the console. I want to be able to detect this to be able to flag these PDFs as corrupt.

I'm looking for a solution that is better than parsing logs.

Thanks for your help!

Sample Console Logs:

WARNING: No Unicode mapping for CID+32 (32) in font F6
WARNING: Failed to find a character mapping for 32 in TimesNewRoman,Bold

Below mentioned post also talks about the same issue but doesn't talk about ways to be able to detect this on code side and handle the same: Issue with reading some unicode characters out of a PDF using PDFBox

Tilman Hausherr · Accepted Answer

A fourth possibility (next to the three given in Aaron Digulla answer) is to override showGlyph() when extending the PDFTextStripper class:

protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException
{
    super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
    if (unicode == null || unicode.isEmpty())
    {
        // do stuff
    }
}

Detect missing / corrupt Unicode mapping in PDF

Tags:

java

pdf

unicode

fonts

pdfbox

Magpies3

1 Answers

Tilman Hausherr

Recent Activity

Donate For Us

Detect missing / corrupt Unicode mapping in PDF

Tags:

java

pdf

unicode

fonts

pdfbox

Magpies3

1 Answers

Tilman Hausherr

Related questions

Recent Activity

Donate For Us