I need to read huge 35G file from disc line by line in C++. Currently I do it the following way:
ifstream infile("myfile.txt");
string line;
while (true) {
if (!getline(infile, line)) break;
long linepos = infile.tellg();
process(line,linepos);
}
But it gives me about 2MB/sec performance, though file manager copies the file with 100Mb/s speed. I guess that getline()
is not doing buffering correctly. Please propose some sort of buffered line-by-line reading approach.
UPD: process() is not a bottleneck, code without process() works with the same speed.
Open a file using the function fopen() and store the reference of the file in a FILE pointer. Read contents of the file using any of these functions fgetc(), fgets(), fscanf(), or fread(). File close the file using the function fclose().
To read from a text file in C, you will need to open a file stream using the fopen() function. Once a file stream has been opened, you can then read the file line by line using the fgets() function.
Method 1: Read a File Line by Line using readlines() readlines() is used to read all the lines at a single go and then return them as each line a string element in a list. This function can be used for small files, as it reads the whole file content to the memory, then split it into separate lines.
I've translated my own buffering code from my java project and it does what I need. I had to put defines to overcome problems with M$VC 2010 compiler tellg, that always gives wrong negative values on huge files. This algorithm gives desired speed ~100MB/s, though it does some usless new[].
void readFileFast(ifstream &file, void(*lineHandler)(char*str, int length, __int64 absPos)){
int BUF_SIZE = 40000;
file.seekg(0,ios::end);
ifstream::pos_type p = file.tellg();
#ifdef WIN32
__int64 fileSize = *(__int64*)(((char*)&p) +8);
#else
__int64 fileSize = p;
#endif
file.seekg(0,ios::beg);
BUF_SIZE = min(BUF_SIZE, fileSize);
char* buf = new char[BUF_SIZE];
int bufLength = BUF_SIZE;
file.read(buf, bufLength);
int strEnd = -1;
int strStart;
__int64 bufPosInFile = 0;
while (bufLength > 0) {
int i = strEnd + 1;
strStart = strEnd;
strEnd = -1;
for (; i < bufLength && i + bufPosInFile < fileSize; i++) {
if (buf[i] == '\n') {
strEnd = i;
break;
}
}
if (strEnd == -1) { // scroll buffer
if (strStart == -1) {
lineHandler(buf + strStart + 1, bufLength, bufPosInFile + strStart + 1);
bufPosInFile += bufLength;
bufLength = min(bufLength, fileSize - bufPosInFile);
delete[]buf;
buf = new char[bufLength];
file.read(buf, bufLength);
} else {
int movedLength = bufLength - strStart - 1;
memmove(buf,buf+strStart+1,movedLength);
bufPosInFile += strStart + 1;
int readSize = min(bufLength - movedLength, fileSize - bufPosInFile - movedLength);
if (readSize != 0)
file.read(buf + movedLength, readSize);
if (movedLength + readSize < bufLength) {
char *tmpbuf = new char[movedLength + readSize];
memmove(tmpbuf,buf,movedLength+readSize);
delete[]buf;
buf = tmpbuf;
bufLength = movedLength + readSize;
}
strEnd = -1;
}
} else {
lineHandler(buf+ strStart + 1, strEnd - strStart, bufPosInFile + strStart + 1);
}
}
lineHandler(0, 0, 0);//eof
}
void lineHandler(char*buf, int l, __int64 pos){
if(buf==0) return;
string s = string(buf, l);
printf(s.c_str());
}
void loadFile(){
ifstream infile("file");
readFileFast(infile,lineHandler);
}
You won't get anywhere close to line speed with the standard IO streams. Buffering or not, pretty much ANY parsing will kill your speed by orders of magnitude. I did experiments on datafiles composed of two ints and a double per line (Ivy Bridge chip, SSD):
f >> i1 >> i2 >> d
) is faster than a getline
into a string followed by a sstringstream
parse.fscanf
get about 40 MB/s.getline
with no parsing: 180 MB/s.fread
: 500-800 MB/s (depending on whether or not the file was cached by the OS).I/O is not the bottleneck, parsing is. In other words, your process
is likely your slow point.
So I wrote a parallel parser. It's composed of tasks (using a TBB pipeline):
fread
large chunks (one such task at a time)I can have unlimited parsing tasks because my data is unordered anyway. If yours isn't then this might not be worth it to you. This approach gets me about 100 MB/s on an 4-core IvyBridge chip.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With