We are doing research in information extraction, and we would like to use iText.
We are on the process of exploring iText. According to the literature we have reviewed, iText is the best tool to use. Is it possible to extract text from pdf per line in iText? I have read a question post here in stackoverflow related to mine but it just read text not to extract it. Can anyone help me with my problem? Thank you.
Like Theodore said you can extract text from a pdf and like Chris pointed out
as long as it is actually text (not outlines or bitmaps)
Best thing to do is buy Bruno Lowagie's book Itext in action. In the second edition chapter 15 covers extracting text.
But you can look at his site for examples. http://itextpdf.com/examples/iia.php?id=279
And you can parse it to create a plain txt file. Here is a code example:
/*
* This class is part of the book "iText in Action - 2nd Edition"
* written by Bruno Lowagie (ISBN: 9781935182610)
* For more info, go to: http://itextpdf.com/examples/
* This example only works with the AGPL version of iText.
*/
package part4.chapter15;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
public class ExtractPageContent {
/** The original PDF that will be parsed. */
public static final String PREFACE = "resources/pdfs/preface.pdf";
/** The resulting text file. */
public static final String RESULT = "results/part4/chapter15/preface.txt";
/**
* Parses a PDF to a plain text file.
* @param pdf the original PDF
* @param txt the resulting text
* @throws IOException
*/
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
}
reader.close();
out.flush();
out.close();
}
/**
* Main method.
* @param args no arguments needed
* @throws IOException
*/
public static void main(String[] args) throws IOException {
new ExtractPageContent().parsePdf(PREFACE, RESULT);
}
}
Notice the license
This example only works with the AGPL version of iText.
If you look at the other examples it will show how to leave out parts of the text or how to extract parts of the pdf.
Hope it helps.
iText allows you to do that, but there is no guarantee about the granularity of the text blocks, those depend on the actual pdf renderers used in producing your documents.
It's quite possible that each word or even letter has its own text block. Nor do these need to be in lexical order, for reliable results you may have to reorder text blocks based on their coordinates. Also you may have to calculate if you need to insert spaces between textblocks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With