I'm reading a large tsv file (~40G) and trying to prune it by reading line by line and print only certain lines to a new file. However, I keep getting the following exception:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2894)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:532)
at java.lang.StringBuffer.append(StringBuffer.java:323)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at java.io.BufferedReader.readLine(BufferedReader.java:379)
Below is the main part of the code. I specified the buffer size to be 8192 just in case. Doesn't Java clear the buffer once the buffer size limit is reached? I don't see what may cause the large memory usage here. I tried to increase the heap size but it didn't make any difference (machine with 4GB RAM). I also tried flushing the output file every X lines but it didn't help either. I'm thinking maybe I need to make calls to the GC but it doesn't sound right.
Any thoughts? Thanks a lot. BTW - I know I should call trim() only once, store it, and then use it.
Set<String> set = new HashSet<String>();
set.add("A-B");
...
...
static public void main(String[] args) throws Exception
{
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile),"UTF-8"), 8192);
PrintStream output = new PrintStream(outputFile, "UTF-8");
String line = reader.readLine();
while(line!=null){
String[] fields = line.split("\t");
if( set.contains(fields[0].trim()+"-"+fields[1].trim()) )
output.println((fields[0].trim()+"-"+fields[1].trim()));
line = reader.readLine();
}
output.close();
}
OutOfMemoryError: Java heap space. 1) An easy way to solve OutOfMemoryError in java is to increase the maximum heap size by using JVM options "-Xmx512M", this will immediately solve your OutOfMemoryError.
Java objects reside in an area called the heap. The heap is created when the JVM starts up and may increase or decrease in size while the application runs. When the heap becomes full, garbage is collected. During the garbage collection objects that are no longer used are cleared, thus making space for new objects.
The default maximum heap size for the Java™ data provider is 256 megabytes. You must set the maximum heap size to an appropriate value that depends on the size of the VMware environment.
Most likely, what's going on is that the file does not have line terminators, and so the reader just keeps growing it's StringBuffer unbounded until it runs out of memory.
The solution would be to read a fixed number of bytes at a time, using the 'read' method of the reader, and then look for new lines (or other parsing tokens) within the smaller buffer(s).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With