I have a pipe-delimited data file with more than 13 columns. The total file size is above 100 MB. I am reading each row, splitting the string into a std::vector<std::string>
so I can do calculations. I repeat this process for all the rows in the file like below:
string filename = "file.dat";
fstream infile(filename);
string line;
while (getline(infile, line)) {
string item;
stringstream ss(line);
vector<string> splittedString;
while (getline(ss, item, '|')) {
splittedString.push_back(item);
}
int a = stoi(splittedString[0]);
// I do some processing like this before some manipulation and calculations with the data
}
This is however very time consuming and I am pretty sure it is not the most optimized way of reading a CSV-type file. How can this be improved?
I tried using the boost::split
function instead of a while loop but it is actually even slower.
So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.
Cell Character Limits csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows.
You don't have a CSV file, because CSV stands for Comma-Separated Values, which you don't have.
You have a delimited text file (apparently delimited by a "|"
). Parsing CSV is more complicated that simply splitting on ","
.
Anyway, without too many dramatic changes to your approach, here are a few suggestions:
vector
out of the loop and clear()
it in every iteration. That will save on heap reallocations.string::find()
instead of stringstream
to split the string.Something like this...
using namespace std;
int main() {
string filename = "file.dat";
fstream infile(filename);
char buffer[65536];
infile.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
string line;
vector<string> splittedString;
while (getline(infile, line)) {
splittedString.clear();
size_t last = 0, pos = 0;
while ((pos = line.find('|', last)) != std::string::npos) {
splittedString.emplace_back(line, last, pos - last);
last = pos + 1;
}
if (last)
splittedString.emplace_back(line, last);
int a = stoi(splittedString[0]);
// I do some processing like this before some manipulation and calculations with the data
}
}
You can save another 50% by eliminating "vector splittedString;" and using in-place parsing with strtok_s()
int main() {
auto t1 = high_resolution_clock::now();
long long a(0);
string filename = "file.txt";
fstream infile(filename);
char buffer[65536];
infile.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
string line;
while (getline(infile, line)) {
char * pch = const_cast<char*>(line.data());
char *nextToken = NULL;
pch = strtok_s(pch, "|", &nextToken);
while (pch != NULL)
{
a += std::stoi(pch);
pch = strtok_s(NULL, "|", &nextToken);
}
}
auto t2 = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(t2 - t1).count();
std::cout << duration << "\n";
std::cout << a << "\n";
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With