Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse large text file with Apache Tika 1.5?

Problem: For my test, I want to extract text data from a 335 MB text file which is wikipedia's "pagecounts-20140701-060000.txt" with Apache Tika.

My solution: I tried to use TikaInputStream since it provides buffering, then I tried to use BufferedInputStream, but that didn't solve my problem. Here is the my test class below:

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class Printer {
    public void readMyFile(String fname) throws IOException, SAXException,
            TikaException {
        System.out.println("Working...");

        File f = new File(fname);
        // InputStream stream = TikaInputStream.get(new File(fname));
        InputStream stream = new BufferedInputStream(new FileInputStream(fname));

        Metadata meta = new Metadata();
        ContentHandler content = new BodyContentHandler(Integer.MAX_VALUE);
        AutoDetectParser parser = new AutoDetectParser();

        String mime = new Tika().detect(f);
        meta.set(Metadata.CONTENT_TYPE, mime);

        System.out.println("trying to parse...");
        try {
            parser.parse(stream, content, meta, new ParseContext());
        } finally {
            stream.close();
        }
    }

    public static void main(String[] args) {
        Printer p = new Printer();
        try {
            p.readMyFile("test/pagecounts-20140701-060000.txt");
        } catch (IOException | SAXException | TikaException e) {
            e.printStackTrace();
        }
    }
}

Problem: Upon invoking the parse method of the parser I am getting:

Working...
trying to parse...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2367)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
    at java.lang.StringBuffer.append(StringBuffer.java:322)
    at java.io.StringWriter.write(StringWriter.java:94)
    at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:92)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:135)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
    at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
    at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
    at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
    at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
    at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:88)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at com.tastyminerals.cli.Printer.readMyFile(Printer.java:37)
    at com.tastyminerals.cli.Printer.main(Printer.java:46) 

I tried to increase jre memory consumption up to -Xms512M -Xmx1024M, that didn't work and I don't want to use any bigger values.

Questions: What is wrong with my code? How should I modify my class to make it extract text from a test file >300 MB with Apache Tika?

like image 434
minerals Avatar asked Sep 17 '25 15:09

minerals


2 Answers

You can set like this to avoid the limit in size :-

BodyContentHandler bodyHandler = new BodyContentHandler(-1);
like image 185
Sachin Avatar answered Sep 20 '25 05:09

Sachin


Pass BodyContentHandler a Writer or OutputStream instead of int

As Gagravarr mentioned, the BodyContentHandler you've used is building an internal string buffer of the file's content. Because Tika is trying to store the entire content in memory at once, this approach will hit OutOfMemoryError exception for large files.

If your goal is to write out the Tika parse results to another file for later processing, you can construct BodyContentHandler with a Writer (or OutputStream directly) instead of passing an int:

Path outputFile = Path.of("output.txt"); // Paths.get() if not using Java 11
PrintWriter printWriter = new PrintWriter(Files.newOutputStream(outputFile));
BodyContentHandler content = new BodyContentHandler(printWriter);

And then call Tika parse:

Path inputFile = Path.of("input.txt");
TikaInputStream inputStream = TikaInputStream.get(inputFile);

AutoDetectParser parser = new AutoDetectParser();
Metadata meta = new Metadata();
ParseContext context = new ParseContext();

parser.parse(inputStream, content, meta, context);

By doing this, Tika will automatically write the content to the outputFile as it parses, instead of trying to keep it all in memory. Using a PrintWriter will buffer the output, reducing the number of writes to disk.

Note that Tika will not automatically close your input or output streams for you.

like image 37
Swizzard Avatar answered Sep 20 '25 06:09

Swizzard