Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving line-wise I/O operations in D

Tags:

python

io

d

I need to process lots of medium to large files (a few hundred MB to GBs) in a linewise manner, so I'm interested in standard D approaches for iterating over lines. The foreach(line; file.byLine()) idiom seems to fit the bill and is pleasantly terse and readable, however performance seems to be less than ideal.

For example, below are two trivial programs in Python and D for iterating over the lines of a file and counting the lines. For a ~470 MB file (~3.6M lines) I get the following timings (best out of 10):

D times:

real    0m19.146s
user    0m18.932s
sys     0m0.190s

Python times (after EDIT 2, see below) :

real    0m0.924s
user    0m0.792s
sys     0m0.129s

Here's the D version, compiled with dmd -O -release -inline -m64:

import std.stdio;
import std.string;

int main(string[] args)
{
  if (args.length < 2) {
    return 1;
  }
  auto infile = File(args[1]);
  uint linect = 0;
  foreach (line; infile.byLine())
    linect += 1;
  writeln("There are: ", linect, " lines.");
  return 0;
}

And now the corresponding Python version:

import sys

if __name__ == "__main__":
    if (len(sys.argv) < 2):
        sys.exit()
    infile = open(sys.argv[1])
    linect = 0
    for line in infile:
        linect += 1
    print "There are %d lines" % linect

EDIT 2: I changed the Python code to use the more idiomatic for line in infile as suggested in the comments below, leading to an even greater speed-up for the Python version, which is now approaching the speed of the standard wc -l call to the Unix wc tool.

Any advice or pointers to what I might be doing wrong in D, that is giving such poor performance?

EDIT: And for comparison, here's a D version that throws the byLine() idiom out the window and sucks all the data into memory at once, and then splits the data into lines post-hoc. This gives better performance but is still about 2x slower than they Python version.

import std.stdio;
import std.string;
import std.file;

int main(string[] args)
{
  if (args.length < 2) {
    return 1;
  }
  auto c = cast(string) read(args[1]);
  auto l = splitLines(c);
  writeln("There are ", l.length, " lines.");
  return 0;
}

The timings for this last version are as follows:

real    0m3.201s
user    0m2.820s
sys     0m0.376s
like image 313
Paul M. Avatar asked Mar 08 '15 02:03

Paul M.


1 Answers

EDIT AND TL;DR: This problem has been solved in https://github.com/D-Programming-Language/phobos/pull/3089. The improved File.byLine performance will be available starting with D 2.068.

I tried your code on a text file with 575247 lines. The Python baseline takes about 0.125 seconds. Here's my codebase with timings embedded in the comments for each method. Explanations follow.

import std.algorithm, std.file, std.stdio, std.string;

int main(string[] args)
{
  if (args.length < 2) {
    return 1;
  }
  size_t linect = 0;

  // 0.62 s
  foreach (line; File(args[1]).byLine())
    linect += 1;

  // 0.2 s
  //linect = args[1].readText.count!(c => c == '\n');

  // 0.095 s
  //linect = args[1].readText.representation.count!(c => c == '\n');

  // 0.11 s
  //linect = File(args[1]).byChunk(4096).joiner.count!(c => c == '\n');

  writeln("There are: ", linect, " lines.");
  return 0;
}

I used dmd -O -release -inline for each variant.

The first version (slowest) reads one line at a time. We could and should improve the performance of byLine; currently it's hamstrung by things like mixed use of byLine with other C stdio operations, which is probably overly conservative. If we do away with that, we can easily do prefetching etc.

The second version reads the file in one fell swoop and then uses a standard algorithm to count the lines with a predicate.

The third version acknowledges the fact that there's no need to mind any UTF subtleties; counting bytes is just as fine, so it converts the string to its byte-wise representation (at no cost) and then counts the bytes.

The last version (my fave) reads 4KB of data from the file at a time and flattens them lazily using joiner. Then again it counts the bytes.

like image 62
Andrei Alexandrescu Avatar answered Sep 29 '22 09:09

Andrei Alexandrescu