Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to write results to file during the computational experiment

I have a piece of software that performs a set of experiments (C++). Without storing the outcomes, all experiments take a little over a minute. The total amount of data generated is equal to 2.5 Gbyte, which is too large to store in memory till the end of the experiment and write to file afterwards. Therefore I write them in chunks.

for(int i = 0; i < chunkSize;i++){
    outfile << results_experiments[i] << endl;
}

where ofstream outfile("data"); and outfile is only closed at the end.

However when I write them in chunks of 4700 kbytes (actually 4700/Chunksize = size of results_experiments element) the experiments take about 50 times longer (over an hour...). This is unacceptable and makes my prior optimization attempts look rather silly. Especially since these experiments again need to be perfomed using many different parameter settings ect.. (at least 100 times, but preferably more)

Concrete my question is:

  • What would be the ideal chunksize to write at?

  • Is there a more efficient way than (or something very inefficient in) the way I write data currently?

Basically: Help me getting the file IO overhead introduced as small as possible..

I think it should be possible to do this a lot faster as copying (writing & reading!) the resulting file (same size), takes me under a minute..

The code should be fairly platform independent and not use any (non standard) libraries (I can provide seperate versions for seperate platforms & more complicated install instructions, but it is a hassle..) If it is not feasible to get the total experiment time under 5 minutes, without platform/library dependencies (and possible with), I will seriously consider introducing these. (platform is windows, but a trivial linux port should at least be possible)

Thank you for your effort.

like image 535
codelidoo Avatar asked Feb 21 '23 06:02

codelidoo


1 Answers

For starters not flushing the buffer for every chunk seems like a good idea. It also seems possible to do the IO asynchronously, as it is completely independent of the computation. You can also use mmap to improve the performance of File I/O.

like image 99
pmr Avatar answered Apr 28 '23 15:04

pmr