Problem: For my test, I want to extract text data from a 335 MB text file which is wikipedia's "pagecounts-20140701-060000.txt" with Apache Tika.
My solution:
I tried to use TikaInputStream
since it provides buffering, then I tried to use BufferedInputStream
, but that didn't solve my problem. Here is the my test class below:
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class Printer {
public void readMyFile(String fname) throws IOException, SAXException,
TikaException {
System.out.println("Working...");
File f = new File(fname);
// InputStream stream = TikaInputStream.get(new File(fname));
InputStream stream = new BufferedInputStream(new FileInputStream(fname));
Metadata meta = new Metadata();
ContentHandler content = new BodyContentHandler(Integer.MAX_VALUE);
AutoDetectParser parser = new AutoDetectParser();
String mime = new Tika().detect(f);
meta.set(Metadata.CONTENT_TYPE, mime);
System.out.println("trying to parse...");
try {
parser.parse(stream, content, meta, new ParseContext());
} finally {
stream.close();
}
}
public static void main(String[] args) {
Printer p = new Printer();
try {
p.readMyFile("test/pagecounts-20140701-060000.txt");
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
}
}
Problem:
Upon invoking the parse
method of the parser
I am getting:
Working...
trying to parse...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
at java.lang.StringBuffer.append(StringBuffer.java:322)
at java.io.StringWriter.write(StringWriter.java:94)
at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:92)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:135)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:88)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at com.tastyminerals.cli.Printer.readMyFile(Printer.java:37)
at com.tastyminerals.cli.Printer.main(Printer.java:46)
I tried to increase jre memory consumption up to -Xms512M -Xmx1024M, that didn't work and I don't want to use any bigger values.
Questions: What is wrong with my code? How should I modify my class to make it extract text from a test file >300 MB with Apache Tika?
You can set like this to avoid the limit in size :-
BodyContentHandler bodyHandler = new BodyContentHandler(-1);
As Gagravarr mentioned, the BodyContentHandler
you've used is building an internal string buffer of the file's content. Because Tika is trying to store the entire content in memory at once, this approach will hit OutOfMemoryError
exception for large files.
If your goal is to write out the Tika parse results to another file for later processing, you can construct BodyContentHandler
with a Writer
(or OutputStream
directly) instead of passing an int
:
Path outputFile = Path.of("output.txt"); // Paths.get() if not using Java 11
PrintWriter printWriter = new PrintWriter(Files.newOutputStream(outputFile));
BodyContentHandler content = new BodyContentHandler(printWriter);
And then call Tika parse:
Path inputFile = Path.of("input.txt");
TikaInputStream inputStream = TikaInputStream.get(inputFile);
AutoDetectParser parser = new AutoDetectParser();
Metadata meta = new Metadata();
ParseContext context = new ParseContext();
parser.parse(inputStream, content, meta, context);
By doing this, Tika will automatically write the content to the outputFile as it parses, instead of trying to keep it all in memory. Using a PrintWriter will buffer the output, reducing the number of writes to disk.
Note that Tika will not automatically close your input or output streams for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With