Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to manipulate *huge* amounts of data

I'm having the following problem. I need to store huge amounts of information (~32 GB) and be able to manipulate it as fast as possible. I'm wondering what's the best way to do it (combinations of programming language + OS + whatever you think its important).

The structure of the information I'm using is a 4D array (NxNxNxN) of double-precission floats (8 bytes). Right now my solution is to slice the 4D array into 2D arrays and store them in separate files in the HDD of my computer. This is really slow and the manipulation of the data is unbearable, so this is no solution at all!

I'm thinking on moving into a Supercomputing facility in my country and store all the information in the RAM, but I'm not sure how to implement an application to take advantage of it (I'm not a professional programmer, so any book/reference will help me a lot).

An alternative solution I'm thinking on is to buy a dedicated server with lots of RAM, but I don't know for sure if that will solve the problem. So right now my ignorance doesn't let me choose the best way to proceed.

What would you do if you were in this situation? I'm open to any idea.

Thanks in advance!


EDIT: Sorry for not providing enough information, I'll try to be more specific.

I'm storing a discretized 4D mathematical function. The operations that I would like to perform includes transposition of the array (change b[i,j,k,l] = a[j,i,k,l] and the likes), array multiplication, etc.

As this is a simulation of a proposed experiment, the operations will be applied only once. Once the result is obtained it wont be necessary to perform more operations on the data.


EDIT (2):

I also would like to be able to store more information in the future, so the solution should be somehow scalable. The current 32 GB goal is because I want to have the array with N=256 points, but it'll be better if I can use N=512 (which means 512 GB to store it!!).

like image 444
Alejandro Cámara Avatar asked Apr 13 '10 13:04

Alejandro Cámara


2 Answers

Amazon's "High Memory Extra Large Instance" is only $1.20/hr and has 34 GB of memory. You might find it useful, assuming you're not running this program constantly..

like image 130
Brendan Long Avatar answered Sep 27 '22 20:09

Brendan Long


Any decent answer will depend on how you need to access the data. Randomly access? Sequential access?

32GB is not really that huge.

How often do you need to process your data? Once per (lifetime | year | day | hour | nanosecond)? Often, stuff only needs to be done once. This has a profound effect on how much you need to optimize your solution.

What kind of operations will you be performing (you mention multiplication)? Can the data be split up into chunks, such that all necessary data for a set of operations is contained in a chunk? This will make splitting it up for parallel execution easier.

Most computers you buy these days have enough RAM to hold your 32GB in memory. You won't need a supercomputer just for that.

like image 38
Daren Thomas Avatar answered Sep 27 '22 20:09

Daren Thomas