Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Tika maxStringLength reached

l have thousands of pdf documents that are 11-15mb. My program says that my document contains more than 100k characters.

Error output:

Exception in thread "main" org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit.

How can l increase the limit to 10-15mb ?

I found a solution which is new Tika facade class but l could not find a way to integrate it with mine.

  Tika tika = new Tika(); 
  tika.setMaxStringLength(10*1024*1024);

Here is my code:

  BodyContentHandler handler = new BodyContentHandler();
  Metadata metadata = new Metadata();
  String location = "C:\\Users\\Laptop\\Dropbox\\MainTextbookTrappe2ndEd.pdf";
  FileInputStream inputstream = new FileInputStream(location);
  ParseContext pcontext = new ParseContext();
  PDFParser pdfparser = new PDFParser(); 
  pdfparser.parse(inputstream, handler, metadata, pcontext);

Output:

System.out.println("Content of the PDF :" + pcontext);
like image 843
Alican Balik Avatar asked Feb 21 '16 22:02

Alican Balik


1 Answers

Use

BodyContentHandler handler = new BodyContentHandler(-1);

to disable the limit. From the Javadoc:

The internal string buffer is bounded at the given number of characters. If this write limit is reached, then a SAXException is thrown.
Parameters: writeLimit - maximum number of characters to include in the string, or -1 to disable the write limit

like image 169
wero Avatar answered Sep 29 '22 13:09

wero