Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed-up loading of 15M integers from file stream?

Tags:

c++

stream

stl

I have an array of precomputed integers, it's fixed size of 15M values. I need to load these values at the program start. Currently it takes up to 2 mins to load, file size is ~130MB. Is it any way to speed-up loading. I'm free to change save process as well.

std::array<int, 15000000> keys;

std::string config = "config.dat";

// how array is saved
std::ofstream out(config.c_str());
std::copy(keys.cbegin(), keys.cend(),
  std::ostream_iterator<int>(out, "\n"));

// load of array
std::ifstream in(config.c_str());
std::copy(std::istream_iterator<int>(in),
  std::istream_iterator<int>(), keys.begin());
in_ranks.close();

Thanks in advance.

SOLVED. Used the approach proposed in accepted answer. Now it takes just a blink.

Thanks all for your insights.

like image 691
drumsta Avatar asked Aug 05 '10 13:08

drumsta


2 Answers

You have two issues regarding the speed of your write and read operations.

First, std::copy cannot do a block copy optimization when writing to an output_iterator because it doesn't have direct access to underlying target.

Second, you're writing the integers out as ascii and not binary, so for each iteration of your write output_iterator is creating an ascii representation of your int and on read it has to parse the text back into integers. I believe this is the brunt of your performance issue.

The raw storage of your array (assuming a 4 byte int) should only be 60MB, but since each character of an integer in ascii is 1 byte any ints with more than 4 characters are going to be larger than the binary storage, hence your 130MB file.

There is not an easy way to solve your speed problem portably (so that the file can be read on different endian or int sized machines) or when using std::copy. The easiest way is to just dump the whole of the array to disk and then read it all back using fstream.write and read, just remember that it's not strictly portable.

To write:

std::fstream out(config.c_str(), ios::out | ios::binary);
out.write( keys.data(), keys.size() * sizeof(int) );

And to read:

std::fstream in(config.c_str(), ios::in | ios::binary);
in.read( keys.data(), keys.size() * sizeof(int) );

----Update----

If you are really concerned about portability you could easily use a portable format (like your initial ascii version) in your distribution artifacts then when the program is first run it could convert that portable format to a locally optimized version for use during subsequent executions.

Something like this perhaps:

std::array<int, 15000000> keys;

// data.txt are the ascii values and data.bin is the binary version
if(!file_exists("data.bin")) {
    std::ifstream in("data.txt");
    std::copy(std::istream_iterator<int>(in),
         std::istream_iterator<int>(), keys.begin());
    in.close();

    std::fstream out("data.bin", ios::out | ios::binary);
    out.write( keys.data(), keys.size() * sizeof(int) );
} else {
    std::fstream in("data.bin", ios::in | ios::binary);
    in.read( keys.data(), keys.size() * sizeof(int) );
}

If you have an install process this preprocessing could also be done at that time...

like image 82
joshperry Avatar answered Oct 01 '22 03:10

joshperry


if the integers are saved in binary format and you're not concerned with Endian problems, try reading the entire file into memory at once (fread) and cast the pointer to int *

like image 28
Steven A. Lowe Avatar answered Oct 01 '22 02:10

Steven A. Lowe