I'm trying to read a large text corpus into memory with Java. At some point it hits a wall and just garbage collects interminably. I'd like to know if anyone has experience beating Java's GC into submission with large data sets. I'm reading an 8 GB file of English text, in UTF-8, with one sentence to a line. I want to <code>split()</code> each line on whitespace and store the resulting String arrays in an <code>ArrayList<String[]></code> for further processing. Here's a simplified program that exhibits the problem: <pre class="prettyprint"><code>/** Load whitespace-delimited tokens from stdin into memory. */ public class LoadTokens { private static final int INITIAL_SENTENCES = 66000000; public static void main(String[] args) throws IOException { List<String[]> sentences = new ArrayList<String[]>(INITIAL_SENTENCES); BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in)); long numTokens = 0; String line; while ((line = stdin.readLine()) != null) { String[] sentence = line.split("\\s+"); if (sentence.length > 0) { sentences.add(sentence); numTokens += sentence.length; } } System.out.println("Read " + sentences.size() + " sentences, " + numTokens + " tokens."); } } </code></pre> Seems pretty cut-and-dried, right? You'll notice I even pre-size my <code>ArrayList</code>; I have a little less than 66 million sentences and 1.3 billion tokens. Now if you whip out your Java object sizes reference and your pencil, you'll find that should require about: <ul> <li>66e6 <code>String[]</code> references @ 8 bytes ea = 0.5 GB</li> <li>66e6 <code>String[]</code> objects @ 32 bytes ea = 2 GB</li> <li>66e6 <code>char[]</code> objects @ 32 bytes ea = 2 GB</li> <li>1.3e9 <code>String</code> references @ 8 bytes ea = 10 GB</li> <li>1.3e9 <code>String</code>s @ 44 bytes ea = 53 GB</li> <li>8e9 <code>char</code>s @ 2 bytes ea = 15 GB</li> </ul> 83 GB. (You'll notice I really do need to use 64-bit object sizes, since Compressed OOPs can't help me with > 32 GB heap.) We're fortunate to have a RedHat 6 machine with 128 GB RAM, so I fire up my Java HotSpot(TM) 64-bit Server VM (build 20.4-b02, mixed mode) from my Java SE 1.6.0_29 kit with <code>pv giant-file.txt | java -Xmx96G -Xms96G LoadTokens</code> just to be safe, and kick back while I watch <code>top</code>. Somewhere less than halfway through the input, at about 50-60 GB RSS, the parallel garbage collector kicks up to 1300% CPU (16 proc box) and read progress stops. Then it goes a few more GB, then progress stops for even longer. It fills up 96 GB and ain't done yet. I've let it go for an hour and a half, and it's just burning ~90% system time doing GC. That seems extreme. To make sure I wasn't crazy, I whipped up the equivalent Python (all two lines ;) and it ran to completion in about 12 minutes and 70 GB RSS. So: am I doing something dumb? (Aside from the generally inefficient way things are being stored, which I can't really help -- and even if my data structures are fat, as long as they they fit, Java shouldn't just suffocate.) Is there magic GC advice for really large heaps? I did try <code>-XX:+UseParNewGC</code> and it seems even worse.

Idea 1 Start by considering this: <pre class="prettyprint"><code>while ((line = stdin.readLine()) != null) { </code></pre> It at least used to be the case that <code>readLine</code> would return a <code>String</code> with a backing <code>char[]</code> of at least 80 characters. Whether or not that becomes a problem depends on what the next line does: <pre class="prettyprint"><code>String[] sentence = line.split("\\s+"); </code></pre> You should determine whether the strings returned by <code>split</code> keep the same backing <code>char[]</code>. If they do (and assuming your lines are often shorter than 80 characters) you should use: <pre class="prettyprint"><code>line = new String(line); </code></pre> This will create a clone of the copy of the string with a "right-sized" string array If they don't, then you should potentially work out some way of creating the same behaviour but changing it so they do use the same backing <code>char[]</code> (i.e. they're substrings of the original line) - and do the same cloning operation, of course. You don't want a separate <code>char[]</code> per word, as that'll waste far more memory than the spaces. Idea 2 Your title talks about the poor performance of lists - but of course you can easily take the list out of the equation here by simply creating a <code>String[][]</code>, at least for test purposes. It looks like you already know the size of the file - and if you don't, you could run it through <code>wc</code> to check beforehand. Just to see if you can avoid that problem to start with. Idea 3 How many distinct words are there in your corpus? Have you considered keeping a <code>HashSet<String></code> and adding each word to it as you come across it? That way you're likely to end up with far fewer strings. At this point you would probably want to abandon the "single backing <code>char[]</code> per line" from the first idea - you'd want each string to be backed by its own char array, as otherwise a line with a single new word in is still going to require a lot of characters. (Alternatively, for real fine-tuning, you could see how many "new words" there are in a line and clone each string or not.)

You should use the following tricks: <ul> <li>Help the JVM to collect the same tokens into a single String reference thanks to <code>sentences.add(sentence.intern())</code>. See String.intern for details. As far as I know, it should also have the effect Jon Skeet spoke about, it cuts char array into small pieces.</li> <li> Use experimental HotSpot options to compact String and char[] implementations and related ones: <pre class="prettyprint"><code>-XX:+UseCompressedStrings -XX:+UseStringCache -XX:+OptimizeStringConcat </code></pre> </li> </ul> With such memory amount, you should configure your system and JVM to use large pages. It is really difficult to improve performance with GC tuning alone and more than 5%. You should first reduce your application memory consumption thanks to profiling. By the way, I wonder if you really need to get the full content of a book in memory - I do not know what your code does next with all sentences but you should consider an alternate option like Lucene indexing tool to count words or extracting any other information from your text.

Poor performance with large Java lists

Tags:

java

text

memory

garbage-collection

large-files

I'm trying to read a large text corpus into memory with Java. At some point it hits a wall and just garbage collects interminably. I'd like to know if anyone has experience beating Java's GC into submission with large data sets.

I'm reading an 8 GB file of English text, in UTF-8, with one sentence to a line. I want to split() each line on whitespace and store the resulting String arrays in an ArrayList<String[]> for further processing. Here's a simplified program that exhibits the problem:

/** Load whitespace-delimited tokens from stdin into memory. */
public class LoadTokens {
    private static final int INITIAL_SENTENCES = 66000000;

    public static void main(String[] args) throws IOException {
        List<String[]> sentences = new ArrayList<String[]>(INITIAL_SENTENCES);
        BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in));
        long numTokens = 0;
        String line;

        while ((line = stdin.readLine()) != null) {
            String[] sentence = line.split("\\s+");
            if (sentence.length > 0) {
                sentences.add(sentence);
                numTokens += sentence.length;
            }
        }
        System.out.println("Read " + sentences.size() + " sentences, " + numTokens + " tokens.");
    }
}

Seems pretty cut-and-dried, right? You'll notice I even pre-size my ArrayList; I have a little less than 66 million sentences and 1.3 billion tokens. Now if you whip out your Java object sizes reference and your pencil, you'll find that should require about:

66e6 String[] references @ 8 bytes ea = 0.5 GB
66e6 String[] objects @ 32 bytes ea = 2 GB
66e6 char[] objects @ 32 bytes ea = 2 GB
1.3e9 String references @ 8 bytes ea = 10 GB
1.3e9 Strings @ 44 bytes ea = 53 GB
8e9 chars @ 2 bytes ea = 15 GB

83 GB. (You'll notice I really do need to use 64-bit object sizes, since Compressed OOPs can't help me with > 32 GB heap.) We're fortunate to have a RedHat 6 machine with 128 GB RAM, so I fire up my Java HotSpot(TM) 64-bit Server VM (build 20.4-b02, mixed mode) from my Java SE 1.6.0_29 kit with pv giant-file.txt | java -Xmx96G -Xms96G LoadTokens just to be safe, and kick back while I watch top.

Somewhere less than halfway through the input, at about 50-60 GB RSS, the parallel garbage collector kicks up to 1300% CPU (16 proc box) and read progress stops. Then it goes a few more GB, then progress stops for even longer. It fills up 96 GB and ain't done yet. I've let it go for an hour and a half, and it's just burning ~90% system time doing GC. That seems extreme.

To make sure I wasn't crazy, I whipped up the equivalent Python (all two lines ;) and it ran to completion in about 12 minutes and 70 GB RSS.

So: am I doing something dumb? (Aside from the generally inefficient way things are being stored, which I can't really help -- and even if my data structures are fat, as long as they they fit, Java shouldn't just suffocate.) Is there magic GC advice for really large heaps? I did try -XX:+UseParNewGC and it seems even worse.

474

asked Mar 06 '12 23:03

Jay Hacker

3 Answers

-XX:+UseConcMarkSweepGC: finishes in 78 GB and ~12 minutes. (Almost as good as Python!) Thanks for everyone's help.

109

answered Oct 19 '22 07:10

Jay Hacker

Idea 1

Start by considering this:

while ((line = stdin.readLine()) != null) {

It at least used to be the case that readLine would return a String with a backing char[] of at least 80 characters. Whether or not that becomes a problem depends on what the next line does:

String[] sentence = line.split("\\s+");

You should determine whether the strings returned by split keep the same backing char[].

If they do (and assuming your lines are often shorter than 80 characters) you should use:

line = new String(line);

This will create a clone of the copy of the string with a "right-sized" string array

If they don't, then you should potentially work out some way of creating the same behaviour but changing it so they do use the same backing char[] (i.e. they're substrings of the original line) - and do the same cloning operation, of course. You don't want a separate char[] per word, as that'll waste far more memory than the spaces.

Idea 2

Your title talks about the poor performance of lists - but of course you can easily take the list out of the equation here by simply creating a String[][], at least for test purposes. It looks like you already know the size of the file - and if you don't, you could run it through wc to check beforehand. Just to see if you can avoid that problem to start with.

Idea 3

How many distinct words are there in your corpus? Have you considered keeping a HashSet<String> and adding each word to it as you come across it? That way you're likely to end up with far fewer strings. At this point you would probably want to abandon the "single backing char[] per line" from the first idea - you'd want each string to be backed by its own char array, as otherwise a line with a single new word in is still going to require a lot of characters. (Alternatively, for real fine-tuning, you could see how many "new words" there are in a line and clone each string or not.)

answered Oct 19 '22 07:10

Jon Skeet

You should use the following tricks:

Help the JVM to collect the same tokens into a single String reference thanks to sentences.add(sentence.intern()). See String.intern for details. As far as I know, it should also have the effect Jon Skeet spoke about, it cuts char array into small pieces.
Use experimental HotSpot options to compact String and char[] implementations and related ones:
```
-XX:+UseCompressedStrings -XX:+UseStringCache -XX:+OptimizeStringConcat
```

With such memory amount, you should configure your system and JVM to use large pages.

It is really difficult to improve performance with GC tuning alone and more than 5%. You should first reduce your application memory consumption thanks to profiling.

By the way, I wonder if you really need to get the full content of a book in memory - I do not know what your code does next with all sentences but you should consider an alternate option like Lucene indexing tool to count words or extracting any other information from your text.

answered Oct 19 '22 07:10

Yves Martin

Related questions
                            
                                Is there a "global" Ctrl + . (Jump to Error) shortcut in Eclipse?
                            
                                Weakreference get() method how safe is it? (Android, asynctask)
                            
                                Android Internal storage VS Shared preferences
                            
                                For Google App Engine (java), how do I set and use chunk size in FetchOptions?
                            
                                Call a static java method of another package from native code
                            
                                What's the difference between two array declarations in Java? [duplicate]
                            
                                Is it possible to retrieve the object instance performing a method call with AspectJ?
                            
                                How to get local repository location from Maven 3.0 plugin?
                            
                                Detect screen resolution change made by user (Java Listener?)
                            
                                Why does this override of a generic method work with 1.6, but not 1.7?
                            
                                New Maven install: mvn -version java.lang.ClassNotFoundException
                            
                                In which case does a java garbage collection of the young generation last long?
                            
                                difference between axis-wsdl4j and wsdl4j
                            
                                StandardOpenOption.SYNC vs StandardOpenOption.DSYNC
                            
                                HTTP POST query how to calculate content length
                            
                                Loading ImageIcon from JAR or file system
                            
                                Java Collections vs in-memory database performance
                            
                                What is the best way to work with many interfaces?
                            
                                Optimistic Caching Concurrency Design Pattern
                            
                                Method return type in Dimension class of java.awt

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Poor performance with large Java lists

Tags:

java

text

memory

garbage-collection

large-files

Jay Hacker

People also ask

3 Answers

Jay Hacker

Jon Skeet

Yves Martin

Recent Activity

Donate For Us