While extracting text from some PDFs PDFBox returns gibberish. This is because of a missing or corrupt Unicode mapping. I can see following warnings on the console. I want to be able to detect this to be able to flag these PDFs as corrupt.
I'm looking for a solution that is better than parsing logs.
Thanks for your help!
Sample Console Logs:
WARNING: No Unicode mapping for CID+32 (32) in font F6
WARNING: Failed to find a character mapping for 32 in TimesNewRoman,Bold
Below mentioned post also talks about the same issue but doesn't talk about ways to be able to detect this on code side and handle the same: Issue with reading some unicode characters out of a PDF using PDFBox
A fourth possibility (next to the three given in Aaron Digulla answer) is to override showGlyph()
when extending the PDFTextStripper
class:
protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException
{
super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
if (unicode == null || unicode.isEmpty())
{
// do stuff
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With