Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting millions of int/string pairs using Java

Tags:

java

sorting

I have 50,000,000 (integer, string) pairs in a text file. The integers are times in milliseconds, so are 13 digits long (e.g. 1337698339089).

The entries in the text file are like this:

1337698339089|blaasdasd
1337698339089|asdasdas
1337698338089|kasda

There can be identical entries.

I want to sort the entries on the integers (in ascending order) preserving any duplicate integers and preserving the (integer, string) pairs. The approach I have taken is leading to memory errors, and so I'm looking for alternative approaches.

My approach is something like this (using some pseudo-code):

// declare TreeMap to do the sorting
TreeMap<Double, String> sorted = new TreeMap<Double, String>();

// loop through entries in text file, and put each in the treemap:
for each entry (integer, string) in the text file:

   Random rand = new Random();
   double inc = 0.0;

   while (sorted.get(integer + inc) != null) {
       inc = rand.nextDouble();
   }

   sorted.put(integer + inc, string);

I am using random numbers here to ensure that duplicate integers can be entered in the treemap (by incrementing them by a double between 0 and 1).

// to print the sorted entries:
for (Double d : sorted.KeySet()) {
    System.out.println(Math.round(d) + "|" + sorted.get(d));
}

This approach works but breaks down for 50,000,000 entries (I think because the treemap is becoming too large; or possibly because the while loop is running for too long).

I would like to know what approach more experienced programmers would take.

Many thanks!

like image 772
Andrew Avatar asked May 22 '12 15:05

Andrew


2 Answers

You should be able to do this with a list, if you have enough memory. I would create a separate class for the entry:

class Foo : Comparable<Foo> {
    private final long time;
    private final String text;

    // Constructor etc
}

In terms of memory, you need to be able to store 50 million instances, and references to them. On a 32-bit JVM, that would be:

  • 8 bytes of overhead per object (IIRC)
  • 8 bytes for the time
  • 4 bytes for the text field
  • ~54 bytes for the string (8 byte overhead + three int fields IIRC + char[] array reference + ~32 bytes for a 10 character array)
  • 4 bytes for the reference in the array or ArrayList

So that's about 80 bytes per instance - say 100 to round up. To store 50,000,000 of those would take 5,000,000,000 bytes, aka 5GB, which is more than I believe a 32-bit JVM will cope with.

So to do all this in memory, you'll need a 64-bit machine and 64-bit JVM, and then the overhead potentially increases somewhat due to larger references etc. Feasible, but not terribly pleasant.

A large part of this is due to the strings, however. If you really wanted to be efficient, you could create a giant char array, then store offsets into it within Foo. Read into the array as you read the text data, and then use it to write out the data after sorting. More complex, and ugly, but considerably more memory-efficient.

Alternatively, you could do this not all in memory - I'm sure if you search around you'll find lots of information about sorting via the file system.

like image 56
Jon Skeet Avatar answered Oct 27 '22 01:10

Jon Skeet


I might consider using a database (like H2; which is convenient since you can pull it right into your Java project) and setup the index the way you want it. Databases have already solved the problem of dealing with a lot of data and organizing it. Then you can do a SQL query to get the results in order and write them back out.

The result set will stream the data out to you in chunks; not try to load everything into a single list.

While H2 does support in memory; I would configure it to use a disk in this case unless you have a lot of RAM and 64bit Java.

like image 36
danpaq Avatar answered Oct 27 '22 01:10

danpaq