Big csv file c++ parsing performance

Tags:

I have a big csv file (25 mb) that represents a symmetric graph (about 18kX18k). While parsing it into an array of vectors, i have analyzed the code (with VS2012 ANALYZER) and it shows that the problem with the parsing efficiency (about 19 seconds total) occurs while reading each character (getline::basic_string::operator+=) as shown in the picture below: enter image description here

This leaves me frustrated, as with Java simple buffered line file reading and tokenizer i achieve it with less than half a second.

My code uses only STL library:

int allColumns = initFirstRow(file,secondRow);
// secondRow has initialized with one value
int column = 1; // dont forget, first column is 0
VertexSet* rows = new VertexSet[allColumns];
rows[1] = secondRow;
string vertexString;
long double vertexDouble;
for (int row = 1; row < allColumns; row ++){
    // dont do the last row
    for (; column < allColumns; column++){
        //dont do the last column
        getline(file,vertexString,','); 
        vertexDouble = stold(vertexString);
        if (vertexDouble > _TH){
            rows[row].add(column);
        }
    }
    // do the last in the column
    getline(file,vertexString);
    vertexDouble = stold(vertexString);
    if (vertexDouble > _TH){
        rows[row].add(++column);
    }
    column = 0;
}
initLastRow(file,rows[allColumns-1],allColumns);

init first and last row basically does the same thing as the loop above, but initFirstRow also counts the number of columns.

VertexSet is basically a vector of indexes (int). Each vertex read (separated by ',') goes no more than 7 characters length long (values are between -1 and 1).

225

asked Dec 28 '13 14:12

squeezy

1 Answers

At 25 megabytes, I'm going to guess that your file is machine generated. As such, you (probably) don't need to worry about things like verifying the format (e.g., that every comma is in place).

Given the shape of the file (i.e., each line is quite long) you probably won't impose a lot of overhead by putting each line into a stringstream to parse out the numbers.

Based on those two facts, I'd at least consider writing a ctype facet that treats commas as whitespace, then imbuing the stringstream with a locale using that facet to make it easy to parse out the numbers. Overall code length would be a little greater, but each part of the code would end up pretty simple:

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <time.h>
#include <stdlib.h>
#include <locale>
#include <sstream>
#include <algorithm>
#include <iterator>

class my_ctype : public std::ctype<char> {
    std::vector<mask> my_table;
public:
    my_ctype(size_t refs=0):
        my_table(table_size),
        std::ctype<char>(my_table.data(), false, refs) 
    {
        std::copy_n(classic_table(), table_size, my_table.data());
        my_table[',']=(mask)space;
    }
};

template <class T>
class converter {
    std::stringstream buffer;
    my_ctype *m;
    std::locale l;
public:
    converter() : m(new my_ctype), l(std::locale::classic(), m) { buffer.imbue(l); }

    std::vector<T> operator()(std::string const &in) {
        buffer.clear();
        buffer<<in;
        return std::vector<T> {std::istream_iterator<T>(buffer),
            std::istream_iterator<T>()};        
    }
};

int main() {
    std::ifstream in("somefile.csv");
    std::vector<std::vector<double>> numbers;

    std::string line;
    converter<double> cvt;

    clock_t start=clock();
    while (std::getline(in, line))
        numbers.push_back(cvt(line));
    clock_t stop=clock();
    std::cout<<double(stop-start)/CLOCKS_PER_SEC << " seconds\n";
}

To test this, I generated an 1.8K x 1.8K CSV file of pseudo-random doubles like this:

#include <iostream>
#include <stdlib.h>

int main() {
    for (int i=0; i<1800; i++) {
        for (int j=0; j<1800; j++)
            std::cout<<rand()/double(RAND_MAX)<<",";
        std::cout << "\n";
    }
}

This produced a file around 27 megabytes. After compiling the reading/parsing code with gcc (g++ -O2 trash9.cpp), a quick test on my laptop showed it running in about 0.18 to 0.19 seconds. It never seems to use (even close to) all of one CPU core, indicating that it's I/O bound, so on a desktop/server machine (with a faster hard drive) I'd expect it to run faster still.

answered Oct 05 '22 16:10

Jerry Coffin

Related questions
                            
                                Ensure at compile time that a method is called in exactly one place
                            
                                The issue about the deferred dynamic initialization
                            
                                Building a subset of boost in windows
                            
                                SFINAE + sizeof = detect if expression compiles
                            
                                How can a Windows program temporarily change its time zone?
                            
                                Compiler warning at C++ template base class
                            
                                Local class template
                            
                                An SSE Stdlib-esque Library?
                            
                                Is it possible to use boost accumulators with vectors?
                            
                                Memory debugging tools for Android NDK C++ code
                            
                                What is the fate of wchar_t in c++0x?
                            
                                Combining a linear algebra library with Boost::Units
                            
                                Calling C# method within a Java program
                            
                                Seeking suggestions for data representation of a probability distribution
                            
                                To find all the repeating substring in a given string
                            
                                Creating complex infinity with std::complex<T> in C++
                            
                                Is it possible to view C++ code with instantiated templates before it gets compiled (g++)?
                            
                                Idiomatic C++11 type promotion
                            
                                Running over an unrolled linked list takes around 40% of the code runtime - are there any obvious ways to optimise it?
                            
                                QMap::contains() VS QMap::find()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With