Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling large datasets in Java/Clojure: littleBig data

I've been working on a graphing/data processing application (you can see a screenshot here) using Clojure (though, oftentimes, it feels like I'm using more Java than Clojure), and have started testing my application with bigger datasets. I have no problem with around 100k points, but when I start getting higher than that, I run into heap space problems.

Now, theoretically, about half a GB should be enough to hold around 70 million doubles. Granted, I'm doing many things that require some overhead, and I may in fact be holding 2-3 copies of the data in memory at the same time, but I haven't optimized much yet, and 500k or so is still orders of magnitude less than that I should be able to load.


I understand that Java has artificial restrictions (that can be changed) on the size of the heap, and I understand those can be changed, in part, with options you can specify as the JVM starts. This leads me to my first questions:

  • Can I change the maximum allowed heap space if I am using Swank-Clojure (via Leiningen) the JVM has on startup?

  • If I package this application (like I plan to) as an Uberjar, would I be able to ensure my JVM has some kind of minimum heap space?

But I'm not content with just relying on the heap of the JVM to power my application. I don't know the size of the data I may eventually be working with, but it could reach millions of points, and perhaps the heap couldn't accommodate that. Therefore, I'm interesting in finding alternatives to just piling the data on. Here are some ideas I had, and questions about them:

  • Would it be possible to read in only parts of a large (text) file at a time, so I could import and process the data in "chunks", e.g, n lines at a time? If so, how?

  • Is there some faster way of accessing the file I'd be reading from (potentially rapidly, depending on the implementation), other than simply reading from it a bit at a time? I guess I'm asking here for any tips/hacks that have worked for you in the past, if you've done a similar thing.

  • Can I "sample" from the file; e.g. read only every z lines, effectively downsampling my data?

Right now I plan on, if there are answers to the above (I'll keep searching!), or insights offered that lead to equivalent solutions, read in a chunk of data at a time, graph it to the timeline (see the screenshot–the timeline is green), and allowed the user to interact with just that bit until she clicks next chunk (or something), then I'd save changes made to a file and load the next "chunk" of data and display it.

Alternatively, I'd display the whole timeline of all the data (downsampled, so I could load it), but only allow access to one "chunk" of it at a time in the main window (the part that is viewed above the green timeline, as outlined by the viewport rectangle in the timeline).


Most of all, though, is there a better way? Note that I cannot downsample the primary window's data, as I need to be able to process it and let the user interact with it (e.g, click a point or near one to add a "marker" to that point: that marker is drawn as a vertical rule over that point).

I'd appreciate any insight, answers, suggestions or corrections! I'm also willing to expound on my question in any way you'd like.

This will hopefully, at least in part, be open-sourced; I'd like a simple-to-use yet fast way to make xy-plots of lots of data in the Clojure world.


EDIT Downsampling is possible only when graphing, and not always then, depending on the parts being graphed. I need access to all the data to perform analysis on. (Just clearing that up!) Though I should definitely look into downsampling, I don't think that will solve my memory issues in the least, as all I'm doing to graph is drawing on a BufferedImage.

like image 926
Isaac Avatar asked Aug 04 '10 15:08

Isaac


2 Answers

Can I change the maximum allowed heap space if I am using Swank-Clojure (via Leiningen) the JVM has on startup?

You can change the Java heap size by supplying the -Xms (min heap) and -Xmx (max heap) options at startup, see the docs.

So something like java -Xms256m -Xmx1024m ... would give 256MB initial heap with the option to grow to 1GB.

I don't use Leiningen/Swank, but I expect that it's possible to change it. If nothing else, there should be a startup script for Java somewhere where you can change the arguments.

If I package this application (like I plan to) as an Uberjar, would I be able to ensure my JVM has some kind of minimum heap space?

Memory isn't controlled from within a jar file, but from the startup script, normally a .sh or .bat file that calls java and supplies the arguments.

Can I "sample" from the file; e.g. read only every z lines?

java.io.RandomAccessFile gives random file access by byte index, which you can build on to sample the contents.

Would it be possible to read in only parts of a large (text) file at a time, so I could import and process the data in "chunks", e.g, n lines at a time? If so, how?

line-seq returns a lazy sequence of each line in a file, so you can process as much at a time as you wish.

Alternatively, use the Java mechanisms in java.io - BufferedReader.readLine() or FileInputStream.read(byte[] buffer)

Is there some faster way of accessing the file I'd be reading from (potentially rapidly, depending on the implementation), other than simply reading from it a bit at a time?

Within Java/Clojure there is BufferedReader, or you can maintain your own byte buffer and read larger chunks at a time.

To make the most out of the memory you have, keep the data as primitive as possible.

For some actual numbers, let's assume you want to graph the contents of a music CD:

  • A CD has two channels, each with 44,100 samples per second
    • 60 min. of music is then ~300 million data points
  • Represented as 16 bits (2 bytes, a short) per datapoint: 600MB
  • Represented as primitive int array (4 bytes per datapoint): 1.2GB
  • Represented as Integer array (32 bytes per datapoint): 10GB

Using the numbers from this blog for object size (16 byte overhead per object, 4 bytes for primitive int, objects aligned to 8-byte boundaries, 8-byte pointers in the array = 32 bytes per Integer datapoint).

Even 600MB of data is a stretch to keep in memory all at once on a "normal" computer, since you will probably be using lots of memory elsewhere too. But the switch from primitive to boxed numbers will all by itself reduce the number of datapoints you can hold in memory by an order of magnitude.

If you were to graph the data from a 60 min CD on a 1900 pixel wide "overview" timeline, you would have one pixel to display two seconds of music (~180,000 datapoints). This is clearly way too little to show any level of detail, you would want some form of subsampling or summary data there.

So the solution you describe - process the full dataset one chunk at a time for a summary display in the 'overview' timeline, and keep only the small subset for the main "detail" window in memory - sounds perfectly reasonable.

Update:

On fast file reads: This article times the file reading speed for 13 different ways to read a 100MB file in Java - the results vary from 0.5 seconds to 10 minutes(!). In general, reading is fast with a decent buffer size (4k to 8k bytes) and (very) slow when reading one byte at a time.

The article also has a comparison to C in case anyone is interested. (Spoiler: The fastest Java reads are within a factor 2 of a memory-mapped file in C.)

like image 199
j-g-faustus Avatar answered Oct 22 '22 17:10

j-g-faustus


Tossing out a couple ideas from left field...

You might find something useful in the Colt library... http://acs.lbl.gov/software/colt/

Or perhaps memory-mapped I/O.

like image 2
p-lo Avatar answered Oct 22 '22 17:10

p-lo