I am looking to optimize reading/writing huge data for a C++ simulation application. The data termed as a "map" essentially consists of integers, doubles, floats and a single enum. A majority of this map data is fixed in size but a small part of it can vary (from a few to several KB) in size. Several such maps (typically millions) are computed once at the start of the application and then stored in a single binary file to be parsed at each simulation time-step.
Since there are a few million maps, parsing of this binary file is quite slow with fseek and fread being the major bottlenecks. I am looking for an alternative approach to doing the same.
Any pointers?
Since you do not mention an OS that you are running this on, have you looked at memory mapping the file and then using standard memory routines to "walk" the file as you go along?
This way you are not using fseek/fread instead you are using pointer arithmetic. Here is an mmap example to copy one file from a source file to a destination file. This may improve the performance.
Other things you could look into, is splitting the files up into smaller files, and using a hash value corresponding to the time unit to close then open the next file to continue the simulation, this way dealing with smaller files that can be more aggressively cached by the host OS!
The effectiveness of this idea depends on your pattern of access, but if you are not looking at that variable size data each cycle, you might speed up access by rearranging your file structure:
Instead of writing a direct dump of a structure like this:
struct {
int x;
enum t;
int sz
char variable_data[sz];
};
you could write all the fixed size parts up front, then store the variable portions afterward:
struct {
int x;
enum t;
int sz;
long offset_to_variable_data;
};
Now, as you parse the file each cycle, you can linearly read N records at a time. You will only have to deal with fseek when you need to fetch the variable-sized data. You might even consider keeping that variable portion in a separate file so that you also only read forward through that file.
This strategy may even improve your performance if you do go with a memory-mapped file as others suggested.
You might consider using memory mapped files. For example see boost::interprocess as they provide a convenient implementation.
Also you might consider using stlxxl which provides STL like functionality aimed towards large filebased datasets.
And one more also - if you want iterator like access to your data, then have a look at boost::iterator_facade.
If you don't want to play with the fancy tricks, you could provide additional binary file containing the index for the file with structures (containing the offsets of the structure starting offsets). This would provide indirect random access.
Maybe not relevant in this case, but I managed to increase performances in an application with heavy file read and write by writing compressed data (zlib), and decompressing on the fly, the decreased read/write time versus the increased CPU load being a win.
Alternatively, if your problem is that the amount of data does not fit in memory and you want to use the disk as a cache, you can look into memcached, which provides a scalable and distributed memory cache.
"millions" maps do not sound like a lot of data. What prevents you from keeping all data in memory?
Another option is to use some standard file format suitable for your needs e.g., sqlite (use SQL to store/retrieve data) or some specialized format like hdf5 or define you own format using something like Google Protocol Buffers.
Use memory mapped file (http://en.wikipedia.org/wiki/Memory-mapped_file);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With