Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect missing / corrupt Unicode mapping in PDF

While extracting text from some PDFs PDFBox returns gibberish. This is because of a missing or corrupt Unicode mapping. I can see following warnings on the console. I want to be able to detect this to be able to flag these PDFs as corrupt.

I'm looking for a solution that is better than parsing logs.

Thanks for your help!

Sample Console Logs:

WARNING: No Unicode mapping for CID+32 (32) in font F6
WARNING: Failed to find a character mapping for 32 in TimesNewRoman,Bold

Below mentioned post also talks about the same issue but doesn't talk about ways to be able to detect this on code side and handle the same: Issue with reading some unicode characters out of a PDF using PDFBox

like image 693
Magpies3 Avatar asked Mar 05 '23 08:03

Magpies3


1 Answers

A fourth possibility (next to the three given in Aaron Digulla answer) is to override showGlyph() when extending the PDFTextStripper class:

protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException
{
    super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
    if (unicode == null || unicode.isEmpty())
    {
        // do stuff
    }
}
like image 168
Tilman Hausherr Avatar answered Mar 13 '23 06:03

Tilman Hausherr