I need to process lots of medium to large files (a few hundred MB to GBs) in a linewise manner, so I'm interested in standard D approaches for iterating over lines. The foreach(line; file.byLine())
idiom seems to fit the bill and is pleasantly terse and readable, however performance seems to be less than ideal.
For example, below are two trivial programs in Python and D for iterating over the lines of a file and counting the lines. For a ~470 MB file (~3.6M lines) I get the following timings (best out of 10):
D times:
real 0m19.146s
user 0m18.932s
sys 0m0.190s
Python times (after EDIT 2, see below) :
real 0m0.924s
user 0m0.792s
sys 0m0.129s
Here's the D version, compiled with dmd -O -release -inline -m64
:
import std.stdio;
import std.string;
int main(string[] args)
{
if (args.length < 2) {
return 1;
}
auto infile = File(args[1]);
uint linect = 0;
foreach (line; infile.byLine())
linect += 1;
writeln("There are: ", linect, " lines.");
return 0;
}
And now the corresponding Python version:
import sys
if __name__ == "__main__":
if (len(sys.argv) < 2):
sys.exit()
infile = open(sys.argv[1])
linect = 0
for line in infile:
linect += 1
print "There are %d lines" % linect
EDIT 2: I changed the Python code to use the more idiomatic for line in infile
as suggested in the comments below, leading to an even greater speed-up for the Python version, which is now approaching the speed of the standard wc -l
call to the Unix wc
tool.
Any advice or pointers to what I might be doing wrong in D, that is giving such poor performance?
EDIT: And for comparison, here's a D version that throws the byLine()
idiom out the window and sucks all the data into memory at once, and then splits the data into lines post-hoc. This gives better performance but is still about 2x slower than they Python version.
import std.stdio;
import std.string;
import std.file;
int main(string[] args)
{
if (args.length < 2) {
return 1;
}
auto c = cast(string) read(args[1]);
auto l = splitLines(c);
writeln("There are ", l.length, " lines.");
return 0;
}
The timings for this last version are as follows:
real 0m3.201s
user 0m2.820s
sys 0m0.376s
EDIT AND TL;DR: This problem has been solved in https://github.com/D-Programming-Language/phobos/pull/3089. The improved File.byLine
performance will be available starting with D 2.068.
I tried your code on a text file with 575247 lines. The Python baseline takes about 0.125 seconds. Here's my codebase with timings embedded in the comments for each method. Explanations follow.
import std.algorithm, std.file, std.stdio, std.string;
int main(string[] args)
{
if (args.length < 2) {
return 1;
}
size_t linect = 0;
// 0.62 s
foreach (line; File(args[1]).byLine())
linect += 1;
// 0.2 s
//linect = args[1].readText.count!(c => c == '\n');
// 0.095 s
//linect = args[1].readText.representation.count!(c => c == '\n');
// 0.11 s
//linect = File(args[1]).byChunk(4096).joiner.count!(c => c == '\n');
writeln("There are: ", linect, " lines.");
return 0;
}
I used dmd -O -release -inline
for each variant.
The first version (slowest) reads one line at a time. We could and should improve the performance of byLine; currently it's hamstrung by things like mixed use of byLine with other C stdio operations, which is probably overly conservative. If we do away with that, we can easily do prefetching etc.
The second version reads the file in one fell swoop and then uses a standard algorithm to count the lines with a predicate.
The third version acknowledges the fact that there's no need to mind any UTF subtleties; counting bytes is just as fine, so it converts the string to its byte-wise representation (at no cost) and then counts the bytes.
The last version (my fave) reads 4KB of data from the file at a time and flattens them lazily using joiner
. Then again it counts the bytes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With