Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Java, is there a way to randomize a file too large to fit into memory?

What I would like to do is shuffle the rows (read from CSV), then print out the first randomized 10,000 rows to one csv and the remainder to a separate csv. With a smaller file I can do something like

java.util.Collections.shuffle(...)
for (int i=0; i < 10000; i++) printcsv(...)
for (int i=10000; i < data.length; i++) printcsv(...)

However with very large files I now get OutOfMemoryError

like image 416
deltanovember Avatar asked Oct 24 '11 12:10

deltanovember


1 Answers

You could:

  • Use more memory or

  • Shuffle not the actual CSV rows, but only a collection of row numbers, and then read the input file line-by-line (buffered, of course) and write the line to one of the desired output files.

like image 74
michael667 Avatar answered Oct 23 '22 12:10

michael667