Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Advice on handling large data volumes

So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once.

Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading.

Should I load everything into memory all at once?
If not, is opening what's a good way of loading the data partially?
What are some Java-relevant efficiency tips?

like image 668
Jake Avatar asked Sep 17 '08 21:09

Jake


3 Answers

So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is constant opening and closing of binary files going to become expensive?

I'm a big fan of 'memory mapped i/o', aka 'direct byte buffers'. In Java they are called Mapped Byte Buffers are are part of java.nio. (Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmatically as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.

I suggest this approach because a) it works for me, and b) it will let you focus on your algorithm and let the JVM, OS and hardware deal with the performance optimization. All to frequently, they know what is best more so than us lowly programmers. ;)

How would you use MBBs in your context? Just create an MBB for each of your files and read them as you see fit. You will only need to store your results. .

BTW: How much data are you dealing with, in GB? If it is more than 3-4GB, then this won't work for you on a 32-bit machine as the MBB implementation is defendant on the addressable memory space by the platform architecture. A 64-bit machine & OS will take you to 1TB or 128TB of mappable data.

If you are thinking about performance, then know Kirk Pepperdine (a somewhat famous Java performance guru.) He is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips and other Java performance related things.

like image 70
Stu Thompson Avatar answered Oct 23 '22 20:10

Stu Thompson


You might want to have a look at the entries in the Wide Finder Project (do a google search for "wide finder" java).

The Wide finder involves reading over lots of lines in log files, so look at the Java implementations and see what worked and didn't work there.

like image 45
Matt Quail Avatar answered Oct 23 '22 20:10

Matt Quail


You could convert to binary, but then you have 1+ something copies of the data, if you need to keep the original around.

It may be practical to build some kind of index on top of your original ascii data, so that if you need to go through the data again you can do it faster in subsequent times.

To answer your questions in order:

Should I load everything into memory all at once?

Not if don't have to. for some files, you may be able to, but if you're just processing sequentially, just do some kind of buffered read through the things one by one, storing whatever you need along the way.

If not, is opening what's a good way of loading the data partially?

BufferedReaders/etc is simplest, although you could look deeper into FileChannel/etc to use memorymapped I/O to go through windows of the data at a time.

What are some Java-relevant efficiency tips?

That really depends on what you're doing with the data itself!

like image 1
John Gardner Avatar answered Oct 23 '22 20:10

John Gardner