PDF text extraction using iText

Question

We are doing research in information extraction, and we would like to use iText.

We are on the process of exploring iText. According to the literature we have reviewed, iText is the best tool to use. Is it possible to extract text from pdf per line in iText? I have read a question post here in stackoverflow related to mine but it just read text not to extract it. Can anyone help me with my problem? Thank you.

ruud van reede · Accepted Answer

Like Theodore said you can extract text from a pdf and like Chris pointed out

as long as it is actually text (not outlines or bitmaps)

Best thing to do is buy Bruno Lowagie's book Itext in action. In the second edition chapter 15 covers extracting text.

But you can look at his site for examples. http://itextpdf.com/examples/iia.php?id=279

And you can parse it to create a plain txt file. Here is a code example:

/*
 * This class is part of the book "iText in Action - 2nd Edition"
 * written by Bruno Lowagie (ISBN: 9781935182610)
 * For more info, go to: http://itextpdf.com/examples/
 * This example only works with the AGPL version of iText.
 */

package part4.chapter15;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;

public class ExtractPageContent {

    /** The original PDF that will be parsed. */
    public static final String PREFACE = "resources/pdfs/preface.pdf";
    /** The resulting text file. */
    public static final String RESULT = "results/part4/chapter15/preface.txt";

    /**
     * Parses a PDF to a plain text file.
     * @param pdf the original PDF
     * @param txt the resulting text
     * @throws IOException
     */
    public void parsePdf(String pdf, String txt) throws IOException {
        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        PrintWriter out = new PrintWriter(new FileOutputStream(txt));
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            out.println(strategy.getResultantText());
        }
        reader.close();
        out.flush();
        out.close();
    }

    /**
     * Main method.
     * @param    args    no arguments needed
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        new ExtractPageContent().parsePdf(PREFACE, RESULT);
    }
}

Notice the license

This example only works with the AGPL version of iText.

If you look at the other examples it will show how to leave out parts of the text or how to extract parts of the pdf.

Hope it helps.

Theodore Bundie · Answer

iText allows you to do that, but there is no guarantee about the granularity of the text blocks, those depend on the actual pdf renderers used in producing your documents.

It's quite possible that each word or even letter has its own text block. Nor do these need to be in lexical order, for reliable results you may have to reorder text blocks based on their coordinates. Also you may have to calculate if you need to insert spaces between textblocks.

PDF text extraction using iText

Tags:

itext

rogelie

2 Answers

ruud van reede

Theodore Bundie

Recent Activity

Donate For Us

PDF text extraction using iText

Tags:

itext

rogelie

2 Answers

ruud van reede

Theodore Bundie

Related questions

Recent Activity

Donate For Us