I've being researching on how to extract images from a big (> 300MB) PDF file. I'm using pdfbox but for some particular reason that I can't figure out, some pages are not correctly extracted.
I'm using the PDFToImage class of pdfbox as base for my code.
So, do you know another library that may help me to do this? I know that iText may be used, but I read that it can't be used for commercial products.
I've installed the packages xpdf and xpdf-utils, and the utility called pdfimages is working perfect. But I need to solve this problem from Java and it should be portable.
Use Adobe Acrobat Professional. To extract information from a PDF in Acrobat DC, choose Tools > Export PDF and select an option. To extract text, export the PDF to a Word format or rich text format, and choose from several advanced options that include: Retain Flowing Text.
Is PDFBox thread safe? No! Only one thread may access a single document at a time. You can have multiple threads each accessing their own PDDocument object.
I think you're talking about two different things here: extracting images from a PDF, and converting PDF pages to images. PDFToImage
will output an image for every page, while pdfimages extracts all embedded images (e.g. a text document has 0 images).
Take a look at org.apache.pdfbox.tools.ExtractImages
(source code) to see if it does what you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With