I'm trying to read a large text corpus into memory with Java. At some point it hits a wall and just garbage collects interminably. I'd like to know if anyone has experience beating Java's GC into submission with large data sets.
I'm reading an 8 GB file of English text, in UTF-8, with one sentence to a line. I want to split()
each line on whitespace and store the resulting String arrays in an ArrayList<String[]>
for further processing. Here's a simplified program that exhibits the problem:
/** Load whitespace-delimited tokens from stdin into memory. */
public class LoadTokens {
private static final int INITIAL_SENTENCES = 66000000;
public static void main(String[] args) throws IOException {
List<String[]> sentences = new ArrayList<String[]>(INITIAL_SENTENCES);
BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in));
long numTokens = 0;
String line;
while ((line = stdin.readLine()) != null) {
String[] sentence = line.split("\\s+");
if (sentence.length > 0) {
sentences.add(sentence);
numTokens += sentence.length;
}
}
System.out.println("Read " + sentences.size() + " sentences, " + numTokens + " tokens.");
}
}
Seems pretty cut-and-dried, right? You'll notice I even pre-size my ArrayList
; I have a little less than 66 million sentences and 1.3 billion tokens. Now if you whip out your Java object sizes reference and your pencil, you'll find that should require about:
String[]
references @ 8 bytes ea = 0.5 GBString[]
objects @ 32 bytes ea = 2 GBchar[]
objects @ 32 bytes ea = 2 GBString
references @ 8 bytes ea = 10 GBString
s @ 44 bytes ea = 53 GBchar
s @ 2 bytes ea = 15 GB83 GB. (You'll notice I really do need to use 64-bit object sizes, since Compressed OOPs can't help me with > 32 GB heap.) We're fortunate to have a RedHat 6 machine with 128 GB RAM, so I fire up my Java HotSpot(TM) 64-bit Server VM (build 20.4-b02, mixed mode) from my Java SE 1.6.0_29 kit with pv giant-file.txt | java -Xmx96G -Xms96G LoadTokens
just to be safe, and kick back while I watch top
.
Somewhere less than halfway through the input, at about 50-60 GB RSS, the parallel garbage collector kicks up to 1300% CPU (16 proc box) and read progress stops. Then it goes a few more GB, then progress stops for even longer. It fills up 96 GB and ain't done yet. I've let it go for an hour and a half, and it's just burning ~90% system time doing GC. That seems extreme.
To make sure I wasn't crazy, I whipped up the equivalent Python (all two lines ;) and it ran to completion in about 12 minutes and 70 GB RSS.
So: am I doing something dumb? (Aside from the generally inefficient way things are being stored, which I can't really help -- and even if my data structures are fat, as long as they they fit, Java shouldn't just suffocate.) Is there magic GC advice for really large heaps? I did try -XX:+UseParNewGC
and it seems even worse.
ArrayList s are more than twice as slow as the corresponding array.
In order to optimize the performance of ArrayLists, it is advisable to set a large enough initial capacity when initializing an ArrayList to incorporate all your data. This will allocate a large enough chunk of memory so that you will probably not need to perform the allocation process again.
An Array is a collection of similar items. Whereas ArrayList can hold item of different types. An array is faster and that is because ArrayList uses a fixed amount of array.
-XX:+UseConcMarkSweepGC
: finishes in 78 GB and ~12 minutes. (Almost as good as Python!) Thanks for everyone's help.
Idea 1
Start by considering this:
while ((line = stdin.readLine()) != null) {
It at least used to be the case that readLine
would return a String
with a backing char[]
of at least 80 characters. Whether or not that becomes a problem depends on what the next line does:
String[] sentence = line.split("\\s+");
You should determine whether the strings returned by split
keep the same backing char[]
.
If they do (and assuming your lines are often shorter than 80 characters) you should use:
line = new String(line);
This will create a clone of the copy of the string with a "right-sized" string array
If they don't, then you should potentially work out some way of creating the same behaviour but changing it so they do use the same backing char[]
(i.e. they're substrings of the original line) - and do the same cloning operation, of course. You don't want a separate char[]
per word, as that'll waste far more memory than the spaces.
Idea 2
Your title talks about the poor performance of lists - but of course you can easily take the list out of the equation here by simply creating a String[][]
, at least for test purposes. It looks like you already know the size of the file - and if you don't, you could run it through wc
to check beforehand. Just to see if you can avoid that problem to start with.
Idea 3
How many distinct words are there in your corpus? Have you considered keeping a HashSet<String>
and adding each word to it as you come across it? That way you're likely to end up with far fewer strings. At this point you would probably want to abandon the "single backing char[]
per line" from the first idea - you'd want each string to be backed by its own char array, as otherwise a line with a single new word in is still going to require a lot of characters. (Alternatively, for real fine-tuning, you could see how many "new words" there are in a line and clone each string or not.)
You should use the following tricks:
Help the JVM to collect the same tokens into a single String reference thanks to sentences.add(sentence.intern())
. See String.intern for details. As far as I know, it should also have the effect Jon Skeet spoke about, it cuts char array into small pieces.
Use experimental HotSpot options to compact String and char[] implementations and related ones:
-XX:+UseCompressedStrings -XX:+UseStringCache -XX:+OptimizeStringConcat
With such memory amount, you should configure your system and JVM to use large pages.
It is really difficult to improve performance with GC tuning alone and more than 5%. You should first reduce your application memory consumption thanks to profiling.
By the way, I wonder if you really need to get the full content of a book in memory - I do not know what your code does next with all sentences but you should consider an alternate option like Lucene indexing tool to count words or extracting any other information from your text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With