There is a large text file of 6.53 GiB. Each line of it can be a data line or comment line. Comment lines are usually short, less than 80 characters, while a data line contains more than 2 million characters and is variable-length.
Considering each data line needs to be dealt with as a unit, is there a simple way to read lines safe and fast in C++?
safe (safe for variable-length data lines): The solution is as easy to use as std::getline()
. Since the length is changing, it is hoped to avoid extra memory management.
fast: The solution can achieve as fast as readline()
in python 3.6.0
, or even as fast as fgets()
of stdio.h
.
A Pure C solution is welcomed. The interface for further processing is provided both in C and C++.
UPDATE 1: Thanks to short but invaluable comment from Basile Starynkevitch, the perfect solution comes up: POSIX getline()
. Since further processing only involves converting from character to number and does not use many features of string class, a char array would be sufficient in this application.
UPDATE 2: Thanks to comments from Zulan and Galik, who both report comparable performance among std::getline()
, fgets()
and POSIX getline()
, another possible solution is to use a better standard library implementation such as libstdc++
. Moreover, here is a report claiming that the Visual C++ and libc++ implementations of std::getline
is not well optimised.
Moving from libc++
to libstdc++
changes the results a lot. With libstdc++ 3.4.13 / Linux 2.6.32 on a different platform, POSIX getline()
, std::getline()
and fgets()
show comparable performance. At the beginning, codes were run under the default settings of clang in Xcode 8.3.2 (8E2002), thus libc++
is used.
More details and some efforts (very long):
getline()
of <string>
can handle arbitrary long lines but is a bit slow. Is there an alternative in C++ for readline()
in python?
// benchmark on Mac OS X with libc++ and SSD:
readline() of python ~550 MiB/s
fgets() of stdio.h, -O0 / -O2 ~1100 MiB/s
getline() of string, -O0 ~27 MiB/s
getline() of string, -O2 ~150 MiB/s
getline() of string + stack buffer, -O2 ~150 MiB/s
getline() of ifstream, -O0 / -O2 ~240 MiB/s
read() of ifstream, -O2 ~340 MiB/s
wc -l ~670 MiB/s
cat data.txt | ./read-cin-unsync ~20 MiB/s
getline() of stdio.h (POSIX.1-2008), -O0 ~1300 MiB/s
Speeds are rounded very roughly, only to show the magnitude, and all code blocks are run several times to assure that the values are representative.
'-O0 / -O2' means the speeds are very similar for both optimization levels
Codes are shown as follows.
readline()
of python
# readline.py
import time
import os
t_start = time.perf_counter()
fname = 'data.txt'
fin = open(fname, 'rt')
count = 0
while True:
l = fin.readline()
length = len(l)
if length == 0: # EOF
break
if length > 80: # data line
count += 1
fin.close()
t_end = time.perf_counter()
time = t_end - t_start
fsize = os.path.getsize(fname)/1024/1024 # file size in MiB
print("speed: %d MiB/s" %(fsize/time))
print("reads %d data lines" %count)
# run as `python readline.py` with python 3.6.0
fgets()
ofstdio.h
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
int main(int argc, char* argv[]){
clock_t t_start = clock();
if(argc != 2) {
fprintf(stderr, "needs one input argument\n");
return EXIT_FAILURE;
}
FILE* fp = fopen(argv[1], "r");
if(fp == NULL) {
perror("Failed to open file");
return EXIT_FAILURE;
}
// maximum length of lines, determined previously by python
const int SIZE = 1024*1024*3;
char line[SIZE];
int count = 0;
while(fgets(line, SIZE, fp) == line) {
if(strlen(line) > 80) {
count += 1;
}
}
clock_t t_end = clock();
const double fsize = 6685; // file size in MiB
double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;
fprintf(stdout, "takes %.2f s\n", time);
fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
fprintf(stdout, "reads %d data lines\n", count);
return EXIT_SUCCESS;
}
getline()
of<string>
// readline-string-getline.cpp
#include <string>
#include <fstream>
#include <iostream>
#include <ctime>
#include <cstdlib>
using namespace std;
int main(int argc, char* argv[]) {
clock_t t_start = clock();
if(argc != 2) {
fprintf(stderr, "needs one input argument\n");
return EXIT_FAILURE;
}
// manually set the buffer on stack
const int BUFFERSIZE = 1024*1024*3; // stack on my platform is 8 MiB
char buffer[BUFFERSIZE];
ifstream fin;
fin.rdbuf()->pubsetbuf(buffer, BUFFERSIZE);
fin.open(argv[1]);
// default buffer setting
// ifstream fin(argv[1]);
if(!fin) {
perror("Failed to open file");
return EXIT_FAILURE;
}
// maximum length of lines, determined previously by python
const int SIZE = 1024*1024*3;
string line;
line.reserve(SIZE);
int count = 0;
while(getline(fin, line)) {
if(line.size() > 80) {
count += 1;
}
}
clock_t t_end = clock();
const double fsize = 6685; // file size in MiB
double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;
fprintf(stdout, "takes %.2f s\n", time);
fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
fprintf(stdout, "reads %d data lines\n", count);
return EXIT_SUCCESS;
}
getline()
ofifstream
// readline-ifstream-getline.cpp
#include <fstream>
#include <iostream>
#include <ctime>
#include <cstdlib>
using namespace std;
int main(int argc, char* argv[]) {
clock_t t_start = clock();
if(argc != 2) {
fprintf(stderr, "needs one input argument\n");
return EXIT_FAILURE;
}
ifstream fin(argv[1]);
if(!fin) {
perror("Failed to open file");
return EXIT_FAILURE;
}
// maximum length of lines, determined previously by python
const int SIZE = 1024*1024*3;
char line[SIZE];
int count = 0;
while(fin.getline(line, SIZE)) {
if(strlen(line) > 80) {
count += 1;
}
}
clock_t t_end = clock();
const double fsize = 6685; // file size in MiB
double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;
fprintf(stdout, "takes %.2f s\n", time);
fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
fprintf(stdout, "reads %d data lines\n", count);
return EXIT_SUCCESS;
}
read()
ofifstream
// seq-read-bin.cpp
// sequentially read the file to see the speed upper bound of
// ifstream
#include <iostream>
#include <fstream>
#include <ctime>
using namespace std;
int main(int argc, char* argv[]) {
clock_t t_start = clock();
if(argc != 2) {
fprintf(stderr, "needs one input argument\n");
return EXIT_FAILURE;
}
ifstream fin(argv[1], ios::binary);
const int SIZE = 1024*1024*3;
char str[SIZE];
while(fin) {
fin.read(str,SIZE);
}
clock_t t_end = clock();
double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;
const double fsize = 6685; // file size in MiB
fprintf(stdout, "takes %.2f s\n", time);
fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
return EXIT_SUCCESS;
}
use
cat
, then read fromcin
withcin.sync_with_stdio(false)
#include <iostream>
#include <ctime>
#include <cstdlib>
using namespace std;
int main(void) {
clock_t t_start = clock();
string input_line;
cin.sync_with_stdio(false);
while(cin) {
getline(cin, input_line);
}
double time = (clock() - t_start) / (double)CLOCKS_PER_SEC;
const double fsize = 6685; // file size in MiB
fprintf(stdout, "takes %.2f s\n", time);
fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
return EXIT_SUCCESS;
}
POSIX getline()
// readline-c-getline.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(int argc, char *argv[]) {
clock_t t_start = clock();
char *line = NULL;
size_t len = 0;
ssize_t nread;
if (argc != 2) {
fprintf(stderr, "Usage: %s <file>\n", argv[1]);
exit(EXIT_FAILURE);
}
FILE *stream = fopen(argv[1], "r");
if (stream == NULL) {
perror("fopen");
exit(EXIT_FAILURE);
}
int length = -1;
int count = 0;
while ((nread = getline(&line, &len, stream)) != -1) {
if (nread > 80) {
count += 1;
}
}
free(line);
fclose(stream);
double time = (clock() - t_start) / (double)CLOCKS_PER_SEC;
const double fsize = 6685; // file size in MiB
fprintf(stdout, "takes %.2f s\n", time);
fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
fprintf(stdout, "reads %d data lines.\n", count);
// fprintf(stdout, "length of MSA: %d\n", length-1);
exit(EXIT_SUCCESS);
}
To look at the last few lines of a file, use the tail command. tail works the same way as head: type tail and the filename to see the last 10 lines of that file, or type tail -number filename to see the last number lines of the file.
The standard way of reading a line of text in C is to use the fgets function, which is fine if you know in advance how long a line of text could be. You can find all the code examples and the input file at the GitHub repo for this article.
tline = fgets( fileID ) reads the next line of the specified file, including the newline characters. tline = fgets( fileID , nchar ) returns up to nchar characters of the next line.
Well, the C standard library is a subset of the C++ standard library. From n4296 draft from C++ 2014 standard:
17.2 The C standard library [library.c]
The C++ standard library also makes available the facilities of the C standard library, suitably adjusted to ensure static type safety.
So provided you explain in a comment that a performance bottleneck requires it, it is perfectly fine to use fgets
in a C++ program - simply you should carefully encapsulate it in an utility class, in order to preserve the OO high level structures.
As I commented, on Linux & POSIX systems, you could consider using getline(3); I guess that the following could compile both as C and as C++ (assuming you do have some valid fopen
-ed FILE*fil;
...)
char* linbuf = NULL; /// or nullptr in C++
size_t linsiz = 0;
ssize_t linlen = 0;
while((linlen=getline(&linbuf, &linsiz,fil))>=0) {
// do something useful with linbuf; but no C++ exceptions
}
free(linbuf); linsiz=0;
I guess this might work (or be easily adapted) to C++. But then, beware of C++ exceptions, they should not go thru the while loop (or you should ensure that an appropriate destructor or catch
is doing free(linbuf);
).
Also getline
could fail (e.g. if it calls a failing malloc
) and you might need to handle that failure sensibly.
Yes, there's a faster way to read lines and create strings.
Query the file size, then load it into a buffer. Then iterate over the buffer replacing the newlines with nuls and storing the pointer to the next line.
It will be quite a bit faster if, as is likely, your platform has a call to load a file into memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With