Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The fastest way to read csv file in c++ which contains large no of columns and rows

I have a pipe-delimited data file with more than 13 columns. The total file size is above 100 MB. I am reading each row, splitting the string into a std::vector<std::string> so I can do calculations. I repeat this process for all the rows in the file like below:

    string filename = "file.dat";
    fstream infile(filename);
    string line;
    while (getline(infile, line)) {
        string item;
        stringstream ss(line);
        vector<string> splittedString;
        while (getline(ss, item, '|')) {
            splittedString.push_back(item);
        }
        int a = stoi(splittedString[0]); 
        // I do some processing like this before some manipulation and calculations with the data
    }

This is however very time consuming and I am pretty sure it is not the most optimized way of reading a CSV-type file. How can this be improved?

update

I tried using the boost::split function instead of a while loop but it is actually even slower.

like image 736
bcsta Avatar asked Jul 17 '19 09:07

bcsta


People also ask

How do I handle a large CSV file?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.

What is the maximum number of columns in a CSV file?

Cell Character Limits csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows.


2 Answers

You don't have a CSV file, because CSV stands for Comma-Separated Values, which you don't have.
You have a delimited text file (apparently delimited by a "|"). Parsing CSV is more complicated that simply splitting on ",".

Anyway, without too many dramatic changes to your approach, here are a few suggestions:

  • Use (more) buffering
  • Move vector out of the loop and clear() it in every iteration. That will save on heap reallocations.
  • Use string::find() instead of stringstream to split the string.

Something like this...

using namespace std;
int main() {
    string filename = "file.dat";
    fstream infile(filename);
    char buffer[65536];
    infile.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
    string line;
    vector<string> splittedString;
    while (getline(infile, line)) {
        splittedString.clear();
        size_t last = 0, pos = 0;
        while ((pos = line.find('|', last)) != std::string::npos) {
            splittedString.emplace_back(line, last, pos - last);
            last = pos + 1;
        }
        if (last)
            splittedString.emplace_back(line, last);
        int a = stoi(splittedString[0]);
        // I do some processing like this before some manipulation and calculations with the data
    }
}
like image 128
rustyx Avatar answered Nov 03 '22 01:11

rustyx


You can save another 50% by eliminating "vector splittedString;" and using in-place parsing with strtok_s()

int main() {
auto t1 = high_resolution_clock::now();
long long a(0);

string filename = "file.txt";
fstream infile(filename);
char buffer[65536];
infile.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
string line;
while (getline(infile, line)) {

    char * pch = const_cast<char*>(line.data());
    char *nextToken = NULL;
    pch = strtok_s(pch, "|", &nextToken);
    while (pch != NULL)
    {
        a += std::stoi(pch);
        pch = strtok_s(NULL, "|", &nextToken);
    }
}

auto t2 = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(t2 - t1).count();
std::cout << duration << "\n";
std::cout << a << "\n";

}

like image 44
Vlad Feinstein Avatar answered Nov 03 '22 01:11

Vlad Feinstein