Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Extract Images and Text in Order from PDF file using iText on Android

I have already done Extracting text from a PDF but now i want to extract the images. the first problem is that the images are between the texts per page. what i want to know is how to Extract Images in Order even the file is a 2 columned per page and how to determine where the Image is placed in the text.

Here are some codes that i have tried.

Image Extraction:

ExtractImages.java:
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
public void extractImages(String filename)
    throws IOException, DocumentException {
    PdfReader reader = new PdfReader(filename);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    MyImageRenderListener listener = new MyImageRenderListener(RESULT);
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        parser.processContent(i, listener);
    }
}

MyImageRenderListener:
public MyImageRenderListener(String path) {
    this.path = path;
}

public void renderImage(ImageRenderInfo renderInfo) {
    try {
        String filename;
        FileOutputStream os;
        PdfImageObject image = renderInfo.getImage();
        if (image == null) return;
        filename = String.format(path, renderInfo.getRef().getNumber(), image.getFileType());
        os = new FileOutputStream(filename);
        os.write(image.getImageAsBytes());
        os.flush();
        os.close();
    } catch (IOException e) {
        System.out.println(e.getMessage());
    }
}

the code process the contents of the pdf and checks for images, then render those images to a image file(.png,.jpg, etc).

The problem i got here is that it do not extract images in order. I want the image in order so i will know what image comes first in a page and last. How do i do that? then, Is it possible to extract the Images without rendering it to a file? My goal with the image is to display it in my android application as Image without turning it in a file. If I its not possible then I will stick to deleting the images when the user is done using it.

My Purpose is to EXTRACT(NOT VIEW) text and images from a pdf file and display it in order in a android application.

like image 930
Christian Eric Paran Avatar asked Nov 25 '12 01:11

Christian Eric Paran


1 Answers

High level approach:

  1. extract all text from the document, without caring about reading-order
  2. determine language of the text based on a distribution of characters, bigrams and trigrams
  3. once the language is known, you know whether to use LTR (left to right) or RTL reading order
  4. using information such as the bounding boxes of each character, and the language, and the font, heuristically build lines of text (a good initial metric might be "join two characters if they are roughly on the same y-position and the gap between their x-positions falls within the average + std_dev range)
  5. once you have built lines, build paragraphs (similar heuristics as before)
  6. Now that you have paragraphs, and the language of the text, you can print out the paragraphs in the correct order.

This is something I have been researching at iText, and it is certainly not a trivial task.

The easiest solution is of course to have a tagged pdf document. Tagged documents contain information about which visual elements belong together in what way. Or, to put it simply, you don't have to concern yourself with building up lines and paragraphs, that's already done and marked.

like image 114
Joris Schellekens Avatar answered Oct 30 '22 08:10

Joris Schellekens