Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize read/write huge data (C++)

I am looking to optimize reading/writing huge data for a C++ simulation application. The data termed as a "map" essentially consists of integers, doubles, floats and a single enum. A majority of this map data is fixed in size but a small part of it can vary (from a few to several KB) in size. Several such maps (typically millions) are computed once at the start of the application and then stored in a single binary file to be parsed at each simulation time-step.

Since there are a few million maps, parsing of this binary file is quite slow with fseek and fread being the major bottlenecks. I am looking for an alternative approach to doing the same.

Any pointers?

like image 701
Vidya Avatar asked Mar 12 '09 22:03

Vidya


6 Answers

Since you do not mention an OS that you are running this on, have you looked at memory mapping the file and then using standard memory routines to "walk" the file as you go along?

This way you are not using fseek/fread instead you are using pointer arithmetic. Here is an mmap example to copy one file from a source file to a destination file. This may improve the performance.

Other things you could look into, is splitting the files up into smaller files, and using a hash value corresponding to the time unit to close then open the next file to continue the simulation, this way dealing with smaller files that can be more aggressively cached by the host OS!

like image 175
X-Istence Avatar answered Nov 15 '22 23:11

X-Istence


The effectiveness of this idea depends on your pattern of access, but if you are not looking at that variable size data each cycle, you might speed up access by rearranging your file structure:
Instead of writing a direct dump of a structure like this:

struct { 
  int x;
  enum t;
  int sz
  char variable_data[sz];
};

you could write all the fixed size parts up front, then store the variable portions afterward:

struct {
  int x;
  enum t;
  int sz;
  long offset_to_variable_data;
};

Now, as you parse the file each cycle, you can linearly read N records at a time. You will only have to deal with fseek when you need to fetch the variable-sized data. You might even consider keeping that variable portion in a separate file so that you also only read forward through that file.

This strategy may even improve your performance if you do go with a memory-mapped file as others suggested.

like image 24
AShelly Avatar answered Nov 16 '22 00:11

AShelly


You might consider using memory mapped files. For example see boost::interprocess as they provide a convenient implementation.

Also you might consider using stlxxl which provides STL like functionality aimed towards large filebased datasets.

And one more also - if you want iterator like access to your data, then have a look at boost::iterator_facade.

If you don't want to play with the fancy tricks, you could provide additional binary file containing the index for the file with structures (containing the offsets of the structure starting offsets). This would provide indirect random access.

like image 32
Anonymous Avatar answered Nov 16 '22 00:11

Anonymous


Maybe not relevant in this case, but I managed to increase performances in an application with heavy file read and write by writing compressed data (zlib), and decompressing on the fly, the decreased read/write time versus the increased CPU load being a win.

Alternatively, if your problem is that the amount of data does not fit in memory and you want to use the disk as a cache, you can look into memcached, which provides a scalable and distributed memory cache.

like image 35
small_duck Avatar answered Nov 16 '22 00:11

small_duck


"millions" maps do not sound like a lot of data. What prevents you from keeping all data in memory?

Another option is to use some standard file format suitable for your needs e.g., sqlite (use SQL to store/retrieve data) or some specialized format like hdf5 or define you own format using something like Google Protocol Buffers.

like image 2
jfs Avatar answered Nov 16 '22 00:11

jfs


Use memory mapped file (http://en.wikipedia.org/wiki/Memory-mapped_file);

like image 1
bayda Avatar answered Nov 16 '22 00:11

bayda