how to read extreme long lines from text file fast and safe in C++?

Tags:

There is a large text file of 6.53 GiB. Each line of it can be a data line or comment line. Comment lines are usually short, less than 80 characters, while a data line contains more than 2 million characters and is variable-length.

Considering each data line needs to be dealt with as a unit, is there a simple way to read lines safe and fast in C++?

safe (safe for variable-length data lines): The solution is as easy to use as std::getline(). Since the length is changing, it is hoped to avoid extra memory management.

fast: The solution can achieve as fast as readline() in python 3.6.0, or even as fast as fgets() of stdio.h.

A Pure C solution is welcomed. The interface for further processing is provided both in C and C++.

UPDATE 1: Thanks to short but invaluable comment from Basile Starynkevitch, the perfect solution comes up: POSIX getline(). Since further processing only involves converting from character to number and does not use many features of string class, a char array would be sufficient in this application.

UPDATE 2: Thanks to comments from Zulan and Galik, who both report comparable performance among std::getline(), fgets() and POSIX getline(), another possible solution is to use a better standard library implementation such as libstdc++. Moreover, here is a report claiming that the Visual C++ and libc++ implementations of std::getline is not well optimised.

Moving from libc++ to libstdc++ changes the results a lot. With libstdc++ 3.4.13 / Linux 2.6.32 on a different platform, POSIX getline(), std::getline() and fgets() show comparable performance. At the beginning, codes were run under the default settings of clang in Xcode 8.3.2 (8E2002), thus libc++ is used.

More details and some efforts (very long):

getline() of <string> can handle arbitrary long lines but is a bit slow. Is there an alternative in C++ for readline() in python?

// benchmark on Mac OS X with libc++ and SSD:
readline() of python                         ~550 MiB/s

fgets() of stdio.h, -O0 / -O2               ~1100 MiB/s

getline() of string, -O0                      ~27 MiB/s
getline() of string, -O2                     ~150 MiB/s
getline() of string + stack buffer, -O2      ~150 MiB/s

getline() of ifstream, -O0 / -O2             ~240 MiB/s
read() of ifstream, -O2                      ~340 MiB/s

wc -l                                        ~670 MiB/s

cat data.txt | ./read-cin-unsync              ~20 MiB/s

getline() of stdio.h (POSIX.1-2008), -O0    ~1300 MiB/s

Speeds are rounded very roughly, only to show the magnitude, and all code blocks are run several times to assure that the values are representative.
'-O0 / -O2' means the speeds are very similar for both optimization levels
Codes are shown as follows.

readline() of python

# readline.py

import time
import os

t_start = time.perf_counter()

fname = 'data.txt'
fin = open(fname, 'rt')

count = 0

while True:
    l = fin.readline()
    length = len(l)
    if length == 0:     # EOF
        break
    if length > 80:     # data line
        count += 1

fin.close()

t_end = time.perf_counter()
time = t_end - t_start

fsize = os.path.getsize(fname)/1024/1024   # file size in MiB
print("speed: %d MiB/s" %(fsize/time))
print("reads %d data lines" %count)

# run as `python readline.py` with python 3.6.0

fgets() of stdio.h

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

int main(int argc, char* argv[]){
  clock_t t_start = clock();

  if(argc != 2) {
    fprintf(stderr, "needs one input argument\n");
    return EXIT_FAILURE;
  }

  FILE* fp = fopen(argv[1], "r");
  if(fp == NULL) {
    perror("Failed to open file");
    return EXIT_FAILURE;
  }

  // maximum length of lines, determined previously by python
  const int SIZE = 1024*1024*3;
  char line[SIZE];

  int count = 0;
  while(fgets(line, SIZE, fp) == line) {
    if(strlen(line) > 80) {
      count += 1;
    }
  }

  clock_t t_end = clock();

  const double fsize = 6685;  // file size in MiB

  double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
  fprintf(stdout, "reads %d data lines\n", count);

  return EXIT_SUCCESS;
}

getline() of <string>

// readline-string-getline.cpp
#include <string>
#include <fstream>
#include <iostream>
#include <ctime>
#include <cstdlib>

using namespace std;

int main(int argc, char* argv[]) {
  clock_t t_start = clock();

  if(argc != 2) {
    fprintf(stderr, "needs one input argument\n");
    return EXIT_FAILURE;
  }

  // manually set the buffer on stack
  const int BUFFERSIZE = 1024*1024*3;   // stack on my platform is 8 MiB
  char buffer[BUFFERSIZE];
  ifstream fin;
  fin.rdbuf()->pubsetbuf(buffer, BUFFERSIZE);
  fin.open(argv[1]);

  // default buffer setting
  // ifstream fin(argv[1]);

  if(!fin) {
    perror("Failed to open file");
    return EXIT_FAILURE;
  }

  // maximum length of lines, determined previously by python
  const int SIZE = 1024*1024*3;
  string line;
  line.reserve(SIZE);

  int count = 0;
  while(getline(fin, line)) {
    if(line.size() > 80) {
      count += 1;
    }
  }

  clock_t t_end = clock();

  const double fsize = 6685;  // file size in MiB

  double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
  fprintf(stdout, "reads %d data lines\n", count);

  return EXIT_SUCCESS;
}

getline() of ifstream

// readline-ifstream-getline.cpp
#include <fstream>
#include <iostream>
#include <ctime>
#include <cstdlib>

using namespace std;

int main(int argc, char* argv[]) {
  clock_t t_start = clock();

  if(argc != 2) {
    fprintf(stderr, "needs one input argument\n");
    return EXIT_FAILURE;
  }

  ifstream fin(argv[1]);
  if(!fin) {
    perror("Failed to open file");
    return EXIT_FAILURE;
  }

  // maximum length of lines, determined previously by python
  const int SIZE = 1024*1024*3;
  char line[SIZE];

  int count = 0;
  while(fin.getline(line, SIZE)) {
    if(strlen(line) > 80) {
      count += 1;
    }
  }

  clock_t t_end = clock();

  const double fsize = 6685;  // file size in MiB

  double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
  fprintf(stdout, "reads %d data lines\n", count);

  return EXIT_SUCCESS;
}

read() of ifstream

// seq-read-bin.cpp
// sequentially read the file to see the speed upper bound of
// ifstream

#include <iostream>
#include <fstream>
#include <ctime>

using namespace std;


int main(int argc, char* argv[]) {
  clock_t t_start = clock();

  if(argc != 2) {
    fprintf(stderr, "needs one input argument\n");
    return EXIT_FAILURE;
  }

  ifstream fin(argv[1], ios::binary);

  const int SIZE = 1024*1024*3;
  char str[SIZE];

  while(fin) {
    fin.read(str,SIZE);
  }

  clock_t t_end = clock();
  double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;

  const double fsize = 6685;  // file size in MiB

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));

  return EXIT_SUCCESS;
}

use cat, then read from cin with cin.sync_with_stdio(false)

#include <iostream>
#include <ctime>
#include <cstdlib>

using namespace std;

int main(void) {
  clock_t t_start = clock();

  string input_line;

  cin.sync_with_stdio(false);

  while(cin) {
    getline(cin, input_line);
  }

  double time = (clock() - t_start) / (double)CLOCKS_PER_SEC;

  const double fsize = 6685;  // file size in MiB

  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));

  return EXIT_SUCCESS;
}

POSIX getline()

// readline-c-getline.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char *argv[]) {

  clock_t t_start = clock();

  char *line = NULL;
  size_t len = 0;
  ssize_t nread;

  if (argc != 2) {
    fprintf(stderr, "Usage: %s <file>\n", argv[1]);
    exit(EXIT_FAILURE);
  }

  FILE *stream = fopen(argv[1], "r");
  if (stream == NULL) {
    perror("fopen");
    exit(EXIT_FAILURE);
  }

  int length = -1;
  int count = 0;
  while ((nread = getline(&line, &len, stream)) != -1) {
    if (nread > 80) {
      count += 1;
    }
  }

  free(line);
  fclose(stream);

  double time = (clock() - t_start) / (double)CLOCKS_PER_SEC;
  const double fsize = 6685;  // file size in MiB
  fprintf(stdout, "takes %.2f s\n", time);
  fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));
  fprintf(stdout, "reads %d data lines.\n", count);
  // fprintf(stdout, "length of MSA: %d\n", length-1);

  exit(EXIT_SUCCESS);
}

606

asked May 29 '17 11:05

Eli4ph

3 Answers

Well, the C standard library is a subset of the C++ standard library. From n4296 draft from C++ 2014 standard:

17.2 The C standard library [library.c]

The C++ standard library also makes available the facilities of the C standard library, suitably adjusted to ensure static type safety.

So provided you explain in a comment that a performance bottleneck requires it, it is perfectly fine to use fgets in a C++ program - simply you should carefully encapsulate it in an utility class, in order to preserve the OO high level structures.

answered Oct 12 '22 23:10

Serge Ballesta

As I commented, on Linux & POSIX systems, you could consider using getline(3); I guess that the following could compile both as C and as C++ (assuming you do have some valid fopen-ed FILE*fil; ...)

char* linbuf = NULL; /// or nullptr in C++
size_t linsiz = 0;
ssize_t linlen = 0;

while((linlen=getline(&linbuf, &linsiz,fil))>=0) {
  // do something useful with linbuf; but no C++ exceptions
}
free(linbuf); linsiz=0;

I guess this might work (or be easily adapted) to C++. But then, beware of C++ exceptions, they should not go thru the while loop (or you should ensure that an appropriate destructor or catch is doing free(linbuf);).

Also getline could fail (e.g. if it calls a failing malloc) and you might need to handle that failure sensibly.

answered Oct 13 '22 01:10

Basile Starynkevitch

Yes, there's a faster way to read lines and create strings.

Query the file size, then load it into a buffer. Then iterate over the buffer replacing the newlines with nuls and storing the pointer to the next line.

It will be quite a bit faster if, as is likely, your platform has a call to load a file into memory.

answered Oct 12 '22 23:10

Malcolm McLean

Related questions
                            
                                c++ function overloading with Exact Match
                            
                                Why can not I use constexpr value in function, but I can do the same in scope of this value?
                            
                                Overload resolution with template parameters
                            
                                MSVC 2013 'type' : is not a member of 'std::enable_if<false,void>
                            
                                Will the C++ Concepts TS enable multiple parameter packs?
                            
                                Literal operator template: why not string?
                            
                                SFINAE with ampersand before a function vs its name
                            
                                std::chrono::time_point invalid value
                            
                                c++11/regex - search for exact string, escape [duplicate]
                            
                                Using a parameter's name inside its own default value - is it legal?
                            
                                R: How to write interruptible C++ function, and recover partial results
                            
                                Prevent implicit conversion but allow list initialisation?
                            
                                What is an event loop in Qt?
                            
                                Do C++ compilers optimize repeated function calls?
                            
                                Conversion Function with decltype(auto) in C++14
                            
                                doxygen is generating empty documentation
                            
                                Difference between cv::Mat::t () and cv::transpose()
                            
                                Double inclusion and headers only library stbi_image
                            
                                How can I know if a USB device is already in use?
                            
                                This code, why does it have to show undefined behavior?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With