Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading and analyzing massive amounts of data

Tags:

java

groovy

So for some research work, I need to analyze a ton of raw movement data (currently almost a gig of data, and growing) and spit out quantitative information and plots.

I wrote most of it using Groovy (with JFreeChart for charting) and when performance became an issue, I rewrote the core parts in Java.

The problem is that analysis and plotting takes about a minute, whereas loading all of the data takes about 5-10 minutes. As you can imagine, this gets really annoying when I want to make small changes to plots and see the output.

I have a couple ideas on fixing this:

  1. Load all of the data into a SQLite database.
    Pros: It'll be fast. I'll be able to run SQL to get aggregate data if I need to.

    Cons: I have to write all that code. Also, for some of the plots, I need access to each point of data, so loading a couple hundred thousand files, some parts may still be slow.

  2. Java RMI to return the object. All the data gets loaded into one root object, which, when serialized, is about 200 megs. I'm not sure how long it would take to transfer a 200meg object through RMI. (same client).

    I'd have to run the server and load all the data but that's not a big deal.

    Major pro: this should take the least amount of time to write

  3. Run a server that loads the data and executes a groovy script on command within the server vm. Overall, this seems like the best idea (for implementation time vs performance as well as other long term benefits)

What I'd like to know is have other people tackled this problem?

Post-analysis (3/29/2011): A couple months after writing this question, I ended up having to learn R to run some statistics. Using R was far, far easier and faster for data analysis and aggregation than what I was doing.

Eventually, I ended up using Java to run preliminary aggregation, and then ran everything else in R. R was also much easier to make beautiful charts than using JFreeChart.

like image 846
Reverend Gonzo Avatar asked Nov 04 '09 01:11

Reverend Gonzo


2 Answers

Databases are very scalable, if you are going to have massive amounts of data. In MS SQL we currently group/sum/filter about 30GB of data in 4 minutes (somewhere around 17 million records I think).

If the data is not going to grow very much, then I'd try out approach #2. You can make a simple test application that creates a 200-400mb object with random data and test the performance of transferring it before deciding if you want to go that route.

like image 107
Ztranger Avatar answered Sep 19 '22 18:09

Ztranger


Before you make a decision its probably worth understanding what is going on with your JVM as well as your physical system resources.

There are several factors that could be at play here:

  • jvm heap size
  • garbage collection algorithms
  • how much physical memory you have
  • how you load the data - is it from a file that is fragmented all over the disk?
  • do you even need to load all of the data at once - can it be done it batches
  • if you are doing it in batches you can vary the batch size and see what happens
  • if your system has multiple cores perhaps you could look at using more than one thread at a time to process/load data
  • if using multiple cores already and disk I/O is the bottleneck, perhaps you could try loading from different disks at the same time

You should also look at http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp if you aren't familiar with the settings for the VM.

like image 42
anger Avatar answered Sep 20 '22 18:09

anger