ASCII data import: how can I match Fortran's bulk read performance in C++?

Question

The setup

Hello, I have Fortran code for reading in ASCII double precision data (example of data file at bottom of question):

program ReadData
    integer :: mx,my,mz
    doubleprecision, allocatable, dimension(:,:,:) :: charge

    ! Open the file 'CHGCAR'
    open(11,file='CHGCAR',status='old')

    ! Get the extent of the 3D system and allocate the 3D array
    read(11,*)mx,my,mz
    allocate(charge(mx,my,mz) )

    ! Bulk read the entire block of ASCII data for the system
    read(11,*) charge
end program ReadData

and the "equivalent" C++ code:

#include <fstream>
#include <vector>

using std::ifstream;
using std::vector;
using std::ios;

int main(){
    int mx, my, mz;

    // Open the file 'CHGCAR'
    ifstream InFile('CHGCAR', ios::in);

    // Get the extent of the 3D system and allocate the 3D array
    InFile >> mx >> my >> mz;
    vector<vector<vector<double> > > charge(mx, vector<vector<double> >(my, vector<double>(mz)));

    // Method 1: std::ifstream extraction operator to double
    for (int i = 0; i < mx; ++i)
        for (int j = 0; j < my; ++j)
            for (int k = 0; k < mz; ++k)
                InFile >> charge[i][j][k];

    return 0;
}

Fortran kicking @$$ and taking names

Note that the line

read(11,*) charge

performs the same task as the C++ code:

for (int i = 0; i < mx; ++i)
    for (int j = 0; j < my; ++j)
        for (int k = 0; k < mz; ++k)
            InFile >> charge[i][j][k];

where InFile is an if stream object (note that while iterators in the Fortran code start at 1 and not 0, the range is the same).

However, the Fortran code runs way, way faster than the C++ code, I think because Fortran does something clever like reading/parsing the file according to the range and shape (values of mx, my, mz) all in one go, and then simply pointing charge to the memory the data was read to. The C++ code, by comparison, needs to access InFile and then charge (which is typically large) back and forth with each iteration, resulting in (I believe) many more IO and memory operations.

I'm reading in potentially billions of of values (several gigabytes), so I really want to maximize performance.

My question:

How can I achieve the performance of the Fortran code in C++?

Moving on...

Here is a much faster (than the above C++) C++ implementation, where the file is read in one go into a char array, and then charge is populated as the char array is parsed:

#include <fstream>
#include <vector>
#include <cstdlib>

using std::ifstream;
using std::vector;
using std::ios;

int main(){
    int mx, my, mz;

    // Open the file 'CHGCAR'
    ifstream InFile('CHGCAR', ios::in);

    // Get the extent of the 3D system and allocate the 3D array
    InFile >> mx >> my >> mz;
    vector<vector<vector<double> > > charge(mx, vector<vector<double> >(my, vector<double>(mz)));

    // Method 2: big char array with strtok() and atof()

    //  Get file size
    InFile.seekg(0, InFile.end);
    int FileSize = InFile.tellg();
    InFile.seekg(0, InFile.beg);

    //  Read in entire file to FileData
    vector<char> FileData(FileSize);
    InFile.read(FileData.data(), FileSize);
    InFile.close();

    /*
     *  Now simply parse through the char array, saving each
     *  value to its place in the array of charge density
     */
    char* TmpCStr = strtok(FileData.data(), " 
");

    // Gets TmpCStr to the first data value
    for (int i = 0; i < 3 && TmpCStr != NULL; ++i)
        TmpCStr = strtok(NULL, " 
");

    for (int i = 0; i < Mz; ++i)
        for (int j = 0; j < My; ++j)
            for (int k = 0; k < Mx && TmpCStr != NULL; ++k){
                Charge[i][j][k] = atof(TmpCStr);
                TmpCStr = strtok(NULL, " 
");
            }

    return 0;
}

Again, this is much faster than the simple >> operator-based method, but still considerably slower than the Fortran version--not to mention much more code.

How to get better performance?

I'm sure that method 2 is the way to go if I am to implement it myself, but I'm curious how I can increase performance to match the Fortran code. The types of things I'm considering and currently researching are:

C++11 and C++14 features
Optimized C or C++ library for doing just this type of thing
Improvements on the individual methods being used in method 2
- tokenization library such as that in the C++ String Toolkit Library instead of strtok()
- more efficient char to double conversion than atof()

C++ String Toolkit

In particular, the C++ String Toolkit Library will take FileData and the delimiters " " and give me a string token object (call it FileTokens, then the triple for loop would look like

for (int k = 0; k < Mz; ++k)
    for (int j = 0; j < My; ++j)
        for (int i = 0; i < Mx; ++i)
            Charge[k][j][i] = FileTokens.nextFloatToken();

This would simplify the code slightly, but there is extra work in copying (in essence) the contents of FileData into FileTokens, which might kill any performance gains from using the nextFloatToken() method (presumedly more efficient than the strtok()/atof() combination).

There is an example on the C++ String Toolkit (StrTk) Tokenizer tutorial page (included at the bottom of the question) using StrTk's for_each_line() processor that looks to be similar to my desired application. A difference between the cases, however, is that I cannot assume how many data will appear on each line of the input file, and I do not know enough about StrTk to say if this is a viable solution.

NOT A DUPLICATE

The topic of fast reading of ASCII data to an array or struct has come up before, but I have reviewed the following posts and their solutions were not sufficient:

Fastest way to read data from a lot of ASCII files
How to read numbers from an ASCII file (C++)
Read Numeric Data from a Text File in C++
Reading a file and storing the contents in an array
C/C++ Fast reading large ASCII data file to array or struct
Read ASCII file into matrix in C++
How can I read ASCII data file in C++
Reading a file and storing the contents in an array
Reading in data in columns from a file (C++)
The Fastest way to read a .txt File
How does fast input/ output work in C/C++, by using registers, hexadecimal number and the likes?
reading file into struct array

Example data

Here is an example of the data file I'm importing. The ASCII data is delimited by spaces and line breaks like the below example:

 5 3 3
 0.23080516813E+04 0.22712439791E+04 0.21616898980E+04 0.19829996749E+04 0.17438686650E+04
 0.14601734127E+04 0.11551623512E+04 0.85678544224E+03 0.59238325489E+03 0.38232265554E+03
 0.23514479113E+03 0.14651943589E+03 0.10252743482E+03 0.85927499703E+02 0.86525872161E+02
 0.10141182750E+03 0.13113419142E+03 0.18057147781E+03 0.25973252462E+03 0.38303754418E+03
 0.57142097675E+03 0.85963728360E+03 0.12548019843E+04 0.17106124085E+04 0.21415379433E+04
 0.24687336309E+04 0.26588012477E+04 0.27189091499E+04 0.26588012477E+04 0.24687336309E+04
 0.21415379433E+04 0.17106124085E+04 0.12548019843E+04 0.85963728360E+03 0.57142097675E+03
 0.38303754418E+03 0.25973252462E+03 0.18057147781E+03 0.13113419142E+03 0.10141182750E+03
 0.86525872161E+02 0.85927499703E+02 0.10252743482E+03 0.14651943589E+03 0.23514479113E+03

StrTk example

Here is the StrTk example mentioned above. The scenario is parsing the data file that contains the information for a 3D mesh:

input data:

5
+1.0,+1.0,+1.0
-1.0,+1.0,-1.0
-1.0,-1.0,+1.0
+1.0,-1.0,-1.0
+0.0,+0.0,+0.0
4
0,1,4
1,2,4
2,3,4
3,1,4

code:

struct point
{
   double x,y,z;
};

struct triangle
{
   std::size_t i0,i1,i2;
};

int main()
{
   std::string mesh_file = "mesh.txt";
   std::ifstream stream(mesh_file.c_str());
   std::string s;
   // Process points section
   std::deque<point> points;
   point p;
   std::size_t point_count = 0;
   strtk::parse_line(stream," ",point_count);
   strtk::for_each_line_n(stream,
                          point_count,
                          [&points,&p](const std::string& line)
                          {
                             if (strtk::parse(line,",",p.x,p.y,p.z))
                                points.push_back(p);
                          });

   // Process triangles section
   std::deque<triangle> triangles;
   triangle t;
   std::size_t triangle_count = 0;
   strtk::parse_line(stream," ",triangle_count);
   strtk::for_each_line_n(stream,
                          triangle_count,
                          [&triangles,&t](const std::string& line)
                          {
                             if (strtk::parse(line,",",t.i0,t.i1,t.i2))
                                triangles.push_back(t);
                          });
   return 0;
}

Tony Delroy · Accepted Answer

This...

vector<vector<vector<double> > > charge(mx, vector<vector<double> >(my, vector<double>(mz)));

...creates a temporary vector<double>(mz), with all 0.0 values, and copies it my times (or perhaps moves then copies my-1 times with a C++11 compiler, but little difference...) to create a temporary vector<vector<double>>(my, ...) which is then copied mx times (...as above...) to initialise all the data. You're reading data in over these elements anyway - there's no need to spend time initialising it here. Instead, create an empty charge and use nested loops to reserve() enough memory for the elements without populating them yet.

Next, check you're compiling with optimisation on. If you are and you're still slower than FORTRAN, in the data-populating nested loops try creating a reference to the vector you're about .emplace_back elements on to:

for (int i = 0; i < mx; ++i)
    for (int j = 0; j < my; ++j)
    {
        std::vector<double>& v = charge[i][j];
        for (int k = 0; k < mz; ++k)
        {
            double d;
            InFile >> d;
            v.emplace_pack(d);
        }
    }

That shouldn't help if your optimiser's done a good job, but is worth trying as a sanity check.

If you're still slower - or just want to try to be even faster - you could try optimising your number parsing: you say your data's all formatted ala 0.23080516813E+04 - with fixed sizes like that you can easily calculate how many bytes to read into a buffer to give you a decent number of values from memory, then for each you could start an atol after the . to extract 23080516813 then multiply it by 10 to the power of minus (11 (your number of digits) minus 04): for speed, keep a table of those powers of ten and index into it using the extracted exponent (i.e. 4). (Note multiplying by e.g. 1E-7 can be faster than dividing by 1E7 on a lot of common hardware.)

And if you want to blitz this thing, switch to using memory mapped file access. Worth considering boost::mapped_file_source as it's easier to use than even the POSIX API (let alone Windows), and portable, but programming directly against an OS API shouldn't be much of a struggle either.

UPDATE - response to first & second comments

Example of using boost memory mapping:

#include <boost/iostreams/device/mapped_file.hpp>

boost::mapped_file_params params("dbldat.in");
boost::mapped_file_source file(params);
file.open();
ASSERT(file.is_open());
const char* p = file.data();
const char* nl = strchr(p, '
');
std::istringstream iss(std::string(p, nl - p));
size_t x, y, z;
ASSERT(iss >> x >> y >> z);

The above maps a file into memory at address p, then parses the dimensions from the first line. Continue parsing the actual double representations from ++nl onwards. I mention an approach to that above, and you're concerned about the data format changing: you could add a version number to the file, so you can use optimised parsing until the version number changes then fall back on something generic for "unknown" file formats. As far as something generic goes, for in-memory representations using int chars_to_skip; double my_double; ASSERT(sscanf(ptr, "%f%n", &my_double, &chars_to_skip) == 1); is reasonable: see sscanf docs here - you can then advance the pointer through the data by chars_to_skip.

Next, are you suggesting to combine the reserve() solution with the reference creation solution?

Yes.

And (pardon my ignorance) why would using a reference to charge[i][j] and v.emplace_back() be better than charge[i][j].emplace_back()?

That suggestion was to sanity check that the compiler's not repeatedly evaluating charge[i][j] for each element being emplaced: hopefully it will make no performance difference and you can go back to the charge[i][j].emplace(), but IMHO it's worth a quick check.

Lastly, I'm skeptical about using an empty vector and reserve()ing at the tops of each loop. I have another program that came to a grinding halt using that method, and replacing the reserve()s with a preallocated multidimensional vector sped it up a lot.

That's possible, but not necessarily true in general or applicable here - a lot depends on the compiler / optimiser (particularly loop unrolling) etc.. With unoptimised emplace_back you're having to check vector size() against capacity() repeatedly, but if the optimiser does a good job that should be reduced to insignificance. As with a lot of performance tuning, you often can't reason about things perfectly and conclude what's going to be fastest, and will have to try alternatives and measure them with your actual compiler, program data etc..

ASCII data import: how can I match Fortran's bulk read performance in C++?

Tags:

c++

ascii

fortran

The setup

Fortran kicking @$$ and taking names

My question:

Moving on...

How to get better performance?

C++ String Toolkit

NOT A DUPLICATE

Example data

StrTk example

twilsonco

1 Answers

UPDATE - response to first & second comments

Tony Delroy

Recent Activity

Donate For Us

ASCII data import: how can I match Fortran's bulk read performance in C++?

Tags:

c++

ascii

fortran

The setup

Fortran kicking @$$ and taking names

My question:

Moving on...

How to get better performance?

C++ String Toolkit

NOT A DUPLICATE

Example data

StrTk example

twilsonco

1 Answers

UPDATE - response to first & second comments

Tony Delroy

Related questions

Recent Activity

Donate For Us