How to read large files using TIka?

Question

I'm parsing large pdf and word documents using Tika but I get he followiing error message.

Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).

How can I increase the limit?

Gagravarr · Accepted Answer

Assuming you're basically following the Tika example for extracting to plain text, then all you need to do is create your BodyContentHandler with a write limit of -1 to disable the write limit, as explained in the javadocs

Your code would then look something like (inspired by the example):

BodyContentHandler handler = new BodyContentHandler(-1);

InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try {
    parser.parse(stream, handler, metadata);
    return handler.toString();
} finally {
    stream.close();
}

How to read large files using TIka?

Tags:

apache-tika

HHH

1 Answers

Gagravarr

Recent Activity

Donate For Us

How to read large files using TIka?

Tags:

apache-tika

HHH

1 Answers

Gagravarr

Related questions

Recent Activity

Donate For Us