I have already done Extracting text from a PDF but now i want to extract the images. the first problem is that the images are between the texts per page. what i want to know is how to Extract Images in Order even the file is a 2 columned per page and how to determine where the Image is placed in the text.
Here are some codes that i have tried.
Image Extraction:
ExtractImages.java:
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
public void extractImages(String filename)
throws IOException, DocumentException {
PdfReader reader = new PdfReader(filename);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
MyImageRenderListener listener = new MyImageRenderListener(RESULT);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
parser.processContent(i, listener);
}
}
MyImageRenderListener:
public MyImageRenderListener(String path) {
this.path = path;
}
public void renderImage(ImageRenderInfo renderInfo) {
try {
String filename;
FileOutputStream os;
PdfImageObject image = renderInfo.getImage();
if (image == null) return;
filename = String.format(path, renderInfo.getRef().getNumber(), image.getFileType());
os = new FileOutputStream(filename);
os.write(image.getImageAsBytes());
os.flush();
os.close();
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
the code process the contents of the pdf and checks for images, then render those images to a image file(.png,.jpg, etc).
The problem i got here is that it do not extract images in order. I want the image in order so i will know what image comes first in a page and last. How do i do that? then, Is it possible to extract the Images without rendering it to a file? My goal with the image is to display it in my android application as Image without turning it in a file. If I its not possible then I will stick to deleting the images when the user is done using it.
My Purpose is to EXTRACT(NOT VIEW) text and images from a pdf file and display it in order in a android application.
High level approach:
This is something I have been researching at iText, and it is certainly not a trivial task.
The easiest solution is of course to have a tagged pdf document. Tagged documents contain information about which visual elements belong together in what way. Or, to put it simply, you don't have to concern yourself with building up lines and paragraphs, that's already done and marked.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With