Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF text extraction using iText

Tags:

itext

We are doing research in information extraction, and we would like to use iText.

We are on the process of exploring iText. According to the literature we have reviewed, iText is the best tool to use. Is it possible to extract text from pdf per line in iText? I have read a question post here in stackoverflow related to mine but it just read text not to extract it. Can anyone help me with my problem? Thank you.

like image 384
rogelie Avatar asked Jan 11 '12 14:01

rogelie


2 Answers

Like Theodore said you can extract text from a pdf and like Chris pointed out

as long as it is actually text (not outlines or bitmaps)

Best thing to do is buy Bruno Lowagie's book Itext in action. In the second edition chapter 15 covers extracting text.

But you can look at his site for examples. http://itextpdf.com/examples/iia.php?id=279

And you can parse it to create a plain txt file. Here is a code example:

/*
 * This class is part of the book "iText in Action - 2nd Edition"
 * written by Bruno Lowagie (ISBN: 9781935182610)
 * For more info, go to: http://itextpdf.com/examples/
 * This example only works with the AGPL version of iText.
 */

package part4.chapter15;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;

public class ExtractPageContent {

    /** The original PDF that will be parsed. */
    public static final String PREFACE = "resources/pdfs/preface.pdf";
    /** The resulting text file. */
    public static final String RESULT = "results/part4/chapter15/preface.txt";

    /**
     * Parses a PDF to a plain text file.
     * @param pdf the original PDF
     * @param txt the resulting text
     * @throws IOException
     */
    public void parsePdf(String pdf, String txt) throws IOException {
        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        PrintWriter out = new PrintWriter(new FileOutputStream(txt));
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            out.println(strategy.getResultantText());
        }
        reader.close();
        out.flush();
        out.close();
    }

    /**
     * Main method.
     * @param    args    no arguments needed
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        new ExtractPageContent().parsePdf(PREFACE, RESULT);
    }
}

Notice the license

This example only works with the AGPL version of iText.

If you look at the other examples it will show how to leave out parts of the text or how to extract parts of the pdf.

Hope it helps.

like image 52
ruud van reede Avatar answered Nov 06 '22 01:11

ruud van reede


iText allows you to do that, but there is no guarantee about the granularity of the text blocks, those depend on the actual pdf renderers used in producing your documents.

It's quite possible that each word or even letter has its own text block. Nor do these need to be in lexical order, for reliable results you may have to reorder text blocks based on their coordinates. Also you may have to calculate if you need to insert spaces between textblocks.

like image 41
Theodore Bundie Avatar answered Nov 06 '22 00:11

Theodore Bundie