Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Poor performance with large Java lists

I'm trying to read a large text corpus into memory with Java. At some point it hits a wall and just garbage collects interminably. I'd like to know if anyone has experience beating Java's GC into submission with large data sets.

I'm reading an 8 GB file of English text, in UTF-8, with one sentence to a line. I want to split() each line on whitespace and store the resulting String arrays in an ArrayList<String[]> for further processing. Here's a simplified program that exhibits the problem:

/** Load whitespace-delimited tokens from stdin into memory. */
public class LoadTokens {
    private static final int INITIAL_SENTENCES = 66000000;

    public static void main(String[] args) throws IOException {
        List<String[]> sentences = new ArrayList<String[]>(INITIAL_SENTENCES);
        BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in));
        long numTokens = 0;
        String line;

        while ((line = stdin.readLine()) != null) {
            String[] sentence = line.split("\\s+");
            if (sentence.length > 0) {
                sentences.add(sentence);
                numTokens += sentence.length;
            }
        }
        System.out.println("Read " + sentences.size() + " sentences, " + numTokens + " tokens.");
    }
}

Seems pretty cut-and-dried, right? You'll notice I even pre-size my ArrayList; I have a little less than 66 million sentences and 1.3 billion tokens. Now if you whip out your Java object sizes reference and your pencil, you'll find that should require about:

  • 66e6 String[] references @ 8 bytes ea = 0.5 GB
  • 66e6 String[] objects @ 32 bytes ea = 2 GB
  • 66e6 char[] objects @ 32 bytes ea = 2 GB
  • 1.3e9 String references @ 8 bytes ea = 10 GB
  • 1.3e9 Strings @ 44 bytes ea = 53 GB
  • 8e9 chars @ 2 bytes ea = 15 GB

83 GB. (You'll notice I really do need to use 64-bit object sizes, since Compressed OOPs can't help me with > 32 GB heap.) We're fortunate to have a RedHat 6 machine with 128 GB RAM, so I fire up my Java HotSpot(TM) 64-bit Server VM (build 20.4-b02, mixed mode) from my Java SE 1.6.0_29 kit with pv giant-file.txt | java -Xmx96G -Xms96G LoadTokens just to be safe, and kick back while I watch top.

Somewhere less than halfway through the input, at about 50-60 GB RSS, the parallel garbage collector kicks up to 1300% CPU (16 proc box) and read progress stops. Then it goes a few more GB, then progress stops for even longer. It fills up 96 GB and ain't done yet. I've let it go for an hour and a half, and it's just burning ~90% system time doing GC. That seems extreme.

To make sure I wasn't crazy, I whipped up the equivalent Python (all two lines ;) and it ran to completion in about 12 minutes and 70 GB RSS.

So: am I doing something dumb? (Aside from the generally inefficient way things are being stored, which I can't really help -- and even if my data structures are fat, as long as they they fit, Java shouldn't just suffocate.) Is there magic GC advice for really large heaps? I did try -XX:+UseParNewGC and it seems even worse.

like image 474
Jay Hacker Avatar asked Mar 06 '12 23:03

Jay Hacker


People also ask

Is ArrayList slow Java?

ArrayList s are more than twice as slow as the corresponding array.

How do you increase the performance of an ArrayList in Java?

In order to optimize the performance of ArrayLists, it is advisable to set a large enough initial capacity when initializing an ArrayList to incorporate all your data. This will allocate a large enough chunk of memory so that you will probably not need to perform the allocation process again.

Is ArrayList slower than array Java?

An Array is a collection of similar items. Whereas ArrayList can hold item of different types. An array is faster and that is because ArrayList uses a fixed amount of array.


3 Answers

-XX:+UseConcMarkSweepGC: finishes in 78 GB and ~12 minutes. (Almost as good as Python!) Thanks for everyone's help.

like image 109
Jay Hacker Avatar answered Oct 19 '22 07:10

Jay Hacker


Idea 1

Start by considering this:

while ((line = stdin.readLine()) != null) {

It at least used to be the case that readLine would return a String with a backing char[] of at least 80 characters. Whether or not that becomes a problem depends on what the next line does:

String[] sentence = line.split("\\s+");

You should determine whether the strings returned by split keep the same backing char[].

If they do (and assuming your lines are often shorter than 80 characters) you should use:

line = new String(line);

This will create a clone of the copy of the string with a "right-sized" string array

If they don't, then you should potentially work out some way of creating the same behaviour but changing it so they do use the same backing char[] (i.e. they're substrings of the original line) - and do the same cloning operation, of course. You don't want a separate char[] per word, as that'll waste far more memory than the spaces.

Idea 2

Your title talks about the poor performance of lists - but of course you can easily take the list out of the equation here by simply creating a String[][], at least for test purposes. It looks like you already know the size of the file - and if you don't, you could run it through wc to check beforehand. Just to see if you can avoid that problem to start with.

Idea 3

How many distinct words are there in your corpus? Have you considered keeping a HashSet<String> and adding each word to it as you come across it? That way you're likely to end up with far fewer strings. At this point you would probably want to abandon the "single backing char[] per line" from the first idea - you'd want each string to be backed by its own char array, as otherwise a line with a single new word in is still going to require a lot of characters. (Alternatively, for real fine-tuning, you could see how many "new words" there are in a line and clone each string or not.)

like image 20
Jon Skeet Avatar answered Oct 19 '22 07:10

Jon Skeet


You should use the following tricks:

  • Help the JVM to collect the same tokens into a single String reference thanks to sentences.add(sentence.intern()). See String.intern for details. As far as I know, it should also have the effect Jon Skeet spoke about, it cuts char array into small pieces.

  • Use experimental HotSpot options to compact String and char[] implementations and related ones:

    -XX:+UseCompressedStrings -XX:+UseStringCache -XX:+OptimizeStringConcat
    

With such memory amount, you should configure your system and JVM to use large pages.

It is really difficult to improve performance with GC tuning alone and more than 5%. You should first reduce your application memory consumption thanks to profiling.

By the way, I wonder if you really need to get the full content of a book in memory - I do not know what your code does next with all sentences but you should consider an alternate option like Lucene indexing tool to count words or extracting any other information from your text.

like image 28
Yves Martin Avatar answered Oct 19 '22 07:10

Yves Martin