Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Big csv file c++ parsing performance

Tags:

c++

io

csv

stl

I have a big csv file (25 mb) that represents a symmetric graph (about 18kX18k). While parsing it into an array of vectors, i have analyzed the code (with VS2012 ANALYZER) and it shows that the problem with the parsing efficiency (about 19 seconds total) occurs while reading each character (getline::basic_string::operator+=) as shown in the picture below: enter image description here

This leaves me frustrated, as with Java simple buffered line file reading and tokenizer i achieve it with less than half a second.

My code uses only STL library:

int allColumns = initFirstRow(file,secondRow);
// secondRow has initialized with one value
int column = 1; // dont forget, first column is 0
VertexSet* rows = new VertexSet[allColumns];
rows[1] = secondRow;
string vertexString;
long double vertexDouble;
for (int row = 1; row < allColumns; row ++){
    // dont do the last row
    for (; column < allColumns; column++){
        //dont do the last column
        getline(file,vertexString,','); 
        vertexDouble = stold(vertexString);
        if (vertexDouble > _TH){
            rows[row].add(column);
        }
    }
    // do the last in the column
    getline(file,vertexString);
    vertexDouble = stold(vertexString);
    if (vertexDouble > _TH){
        rows[row].add(++column);
    }
    column = 0;
}
initLastRow(file,rows[allColumns-1],allColumns);

init first and last row basically does the same thing as the loop above, but initFirstRow also counts the number of columns.

VertexSet is basically a vector of indexes (int). Each vertex read (separated by ',') goes no more than 7 characters length long (values are between -1 and 1).

like image 225
squeezy Avatar asked Dec 28 '13 14:12

squeezy


People also ask

How big is too large for CSV?

Excel is limited to opening CSVs that fit within your computer's RAM. For most modern computers, that means a limit of about 60,000 to 200,000 rows.

Are CSV files faster to handle?

Generally, CSV is much, much faster than MySQL.

How many rows is too many for CSV?

csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows.


1 Answers

At 25 megabytes, I'm going to guess that your file is machine generated. As such, you (probably) don't need to worry about things like verifying the format (e.g., that every comma is in place).

Given the shape of the file (i.e., each line is quite long) you probably won't impose a lot of overhead by putting each line into a stringstream to parse out the numbers.

Based on those two facts, I'd at least consider writing a ctype facet that treats commas as whitespace, then imbuing the stringstream with a locale using that facet to make it easy to parse out the numbers. Overall code length would be a little greater, but each part of the code would end up pretty simple:

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <time.h>
#include <stdlib.h>
#include <locale>
#include <sstream>
#include <algorithm>
#include <iterator>

class my_ctype : public std::ctype<char> {
    std::vector<mask> my_table;
public:
    my_ctype(size_t refs=0):
        my_table(table_size),
        std::ctype<char>(my_table.data(), false, refs) 
    {
        std::copy_n(classic_table(), table_size, my_table.data());
        my_table[',']=(mask)space;
    }
};

template <class T>
class converter {
    std::stringstream buffer;
    my_ctype *m;
    std::locale l;
public:
    converter() : m(new my_ctype), l(std::locale::classic(), m) { buffer.imbue(l); }

    std::vector<T> operator()(std::string const &in) {
        buffer.clear();
        buffer<<in;
        return std::vector<T> {std::istream_iterator<T>(buffer),
            std::istream_iterator<T>()};        
    }
};

int main() {
    std::ifstream in("somefile.csv");
    std::vector<std::vector<double>> numbers;

    std::string line;
    converter<double> cvt;

    clock_t start=clock();
    while (std::getline(in, line))
        numbers.push_back(cvt(line));
    clock_t stop=clock();
    std::cout<<double(stop-start)/CLOCKS_PER_SEC << " seconds\n";
}

To test this, I generated an 1.8K x 1.8K CSV file of pseudo-random doubles like this:

#include <iostream>
#include <stdlib.h>

int main() {
    for (int i=0; i<1800; i++) {
        for (int j=0; j<1800; j++)
            std::cout<<rand()/double(RAND_MAX)<<",";
        std::cout << "\n";
    }
}

This produced a file around 27 megabytes. After compiling the reading/parsing code with gcc (g++ -O2 trash9.cpp), a quick test on my laptop showed it running in about 0.18 to 0.19 seconds. It never seems to use (even close to) all of one CPU core, indicating that it's I/O bound, so on a desktop/server machine (with a faster hard drive) I'd expect it to run faster still.

like image 59
Jerry Coffin Avatar answered Oct 05 '22 16:10

Jerry Coffin