Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Tika and character limit when parsing documents

Could please anybody help me to sort it out?

It can be done like this

   Tika tika = new Tika();
   tika.setMaxStringLength(10*1024*1024);

But if you don't use Tika directly, like this:

ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();

ParseContext ps = new ParseContext();
for (InputStream is : getInputStreams()) {
    parser.parse(is, textHandler, metadata, ps);
    is.close();
    System.out.println("Title: " + metadata.get("title"));
    System.out.println("Author: " + metadata.get("Author"));
}

There is no way to set it up, because you don't interact with the WriteOutContentHandler. Btw it is set to -1 by default which means no restrictions. But the resulting restriction is 100000 characters.

/**
 * The maximum number of characters to write to the character stream.
 * Set to -1 for no limit.
 */
private final int writeLimit;

/**
 * Number of characters written so far.
 */
private int writeCount = 0;

private WriteOutContentHandler(Writer writer, int writeLimit) {
    this.writer = writer;
    this.writeLimit = writeLimit;
}

/**
 * Creates a content handler that writes character events to
 * the given writer.
 *
 * @param writer writer
 */
public WriteOutContentHandler(Writer writer) {
    this(writer, -1);
}
like image 600
lisak Avatar asked May 26 '11 20:05

lisak


1 Answers

You must have overlooked that the content handler has constructor with writelimit.

ContentHandler textHandler = new BodyContentHandler(int writeLimit);
like image 62
lisak Avatar answered Oct 11 '22 13:10

lisak