Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure to store huge amount of data?

In my application,I have to load volumedata from set of images (MRC images) and keep the pixel data in memory.(images are grayscaled ,so one byte per pixel).

My development environment is QT framework ,MinGW for Windows and GCC for Linux.

At the moment,I use a simple data structure to store volumedata as :

unsigned char *volumeData;

and do one huge allocation as follows.

volumeData=new unsigned char[imageXsize * imageYsize * numofImages];

Following are the important methods to access image-data in a given plane,such as

unsigned char* getXYPlaneSlice(int z_value);
unsigned char* getYZPlaneSlice(int x_value);
unsigned char* getZXPlaneSlice(int y_value);

With my simple data-structure it was easy to implement above methods.

But we might need to adopt to volume size as 2000x2000x1000 (~3.7Gb) in the future.And current datastructure will not be able to handle that huge data.

  1. How to avoid fragmentation ? Now,even with 1000x1000x200 data, application crash giving bad_alloc. Whats the best way to change the datastructure for this ? shall I use something like linked-list which each chunk is of size 100mb.

  2. Also,user should be able to perfome some image-processing filters on volume-data and also should be able to reset to original pixel value. That means, I should keep two copies of volume-data. With current implemetation its like.

    unsigned char *volumeDataOriginal;

    unsigned char *volumeDataCurrent;

So with 2000x2000x1000 data-range its going to utilize about 8Gb (4Gb for each volume). But in Win32 , the address space is 4GB.How to tackle this ? I should go with 64bit application ?

EDIT : Here is a snapshot of my application enter image description here

Basically,I load the volume-data (from set of images,from MRC format..etc) and display them in different plane-viewers (XY,YX,YZ.Image shows XY-plane-viewer).I need to keep above 3 data-access methods to show an image in a particular plane.using slider-bar user can change which image to show in the selected plane)

Thanks in advance.

like image 509
Ashika Umanga Umagiliya Avatar asked Nov 09 '10 07:11

Ashika Umanga Umagiliya


People also ask

Which data structure is used for fast searches in extremely large data set?

Arrays. The array is the most basic data structure, merely a list of data elements that you can access by an index, which is the data's position inside the array. Arrays are quite efficient at searching if the elements in the array are ordered.

Which data structure will you use for storing huge data and why?

To store huge data donot use arrays where implementation is easy but insertion and deletion is very difficult. Hence use dynamic like linked list where insertion and deletion is easy. Implement your problem using linked list where the members of linked list are arrays of 100 characters size.

Which data structure is best for storing large data?

Best is probably a bucketed hash table. By placing hash collisions into buckets and keeping separate arrays in the bucket for keys and values, you can both reduce the size of the table proper and take advantage of CPU cache speedup when searching a bucket.

Which data structure should I use to store 10000 elements?

The HashSet provides both O(1) insertion and O(1) search, which is hard to top from a theoretical point of view. In practice, for a size of 10.000 references, a sorted ArrayList might still outperform HashSet , although insertion is O(n) and search is O(log(n)).


2 Answers

I think you should take a look at hdf5. This is a binary format for storing huge amounts of data collected from things like telescopes, physics experiments, and gene-sequencing machines. The benefits of using something like this are many, but three immediate thoughts are: (1) tested, (2) supports hyperslab selection, and (3) you get compression for free.

There are C/C++, java, python, matlab libraries available.

like image 66
jimmyb Avatar answered Nov 07 '22 11:11

jimmyb


The simplest solution to your problem would be to use 64-bit address spaces - modern Macs support this out of the box, on Windows and Linux you will need to install the 64-bit version of the OS. I believe Qt can be used to build 64-bit apps quite nicely. 32-bit systems won't be able to support single allocations of the size you're talking about - even a Mac with 4 GB of address space available to applications won't be able to make a single 3.7 GB allocation as there will not be a contiguous space of that size available.

For undo I would look at using memory-mapped files and copy-on-write to copy the block:

http://en.wikipedia.org/wiki/Copy-on-write

This means you don't actually have to copy all the original data, the system will make copies of pages as they are written to. This will greatly aid performance if your images are significantly bigger than real memory and you're not changing every part of the image. It looks like boost::map_file with "private" access might be helpful for this.

If you really, really need to support 32-bit systems, your only alternative is to break those big blocks down somehow, typically into planes or sub-volumes. Both are horrible to work with when it comes to applying 3D filters etc. though, so I really would avoid this if you can.

If you do go the sub-volume route, one trick is to save all the sub-volumes in memory-mapped files, and map them into your address space only when you need them. When unmapped from the address space they should stay around in the unified buffer cache until purged, effectively this means you can utilise more RAM than you have address space (particularly on Windows where 32-bit applications only get 2 GB of address space by default).

Finally, on 32-bit Windows you can also look at the /3GB switch in boot.ini. This allows you to allocate 3 GB of address space to applications rather than the normal 2 GB. From the problem you describe I don't think this will give you enough address space, however it may help you with some smaller volumes. Note that the /3GB switch can cause problems with some drivers as it reduces the amount of address space available to the kernel.

like image 29
Bids Avatar answered Nov 07 '22 12:11

Bids