Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing 20 GB input file to an ArrayList

Tags:

java

I need to sort a 20 GB file ( which consists of random numbers) in the ascending order, But I am not understanding what technique should I use. I tried to use ArrayList in my Java Program, but it runs out of Memory. Increasing the heap size didn't work too, I guess 20 GB is too big. Can anybody guide me, how should I proceed ?

like image 395
Vikramjit Avatar asked Jan 31 '14 10:01

Vikramjit


2 Answers

You shall use an external sorting algorithm, do not try to fit this in memory.

http://en.wikipedia.org/wiki/External_sorting

If you think it is too complex, try this:

  1. include H2 database in your project
  2. create a new on-disk database (will be created automatically on first connection)
  3. create some simple table where the numbers will be stored
  4. read data number-by-number and insert it into the database (do not forget to commit each 1000 numbers or so)
  5. select numbers with ORDER BY clause :)
  6. use JDBC resultSet to fetch results on-the-fly and write them to an output file

H2 database is simple, works very well with Java and can be embedded in your JAR (does not need any kind of installation or setup).

like image 82
voho Avatar answered Oct 19 '22 06:10

voho


You don't need any special tools for this really. This is a textbook case for external merge sort, wherein you read in parts of the large file at a time (say 100M), sort them, and write the sorted results to an external file. Read in another part, sort it, spit it back out, until there's nothing left to sort. Then you need to read in the sorted chunks, a smaller piece at a time (say 10M) and sort those in memory. The tricky point is to merge those sorted bits together in the right way. Read the external sorting page on Wikipedia as well, as already mentioned. Also, here's an implementation in Java that does this kind of external merge sorting.

like image 38
Martin Dinov Avatar answered Oct 19 '22 07:10

Martin Dinov