Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compilation issue with split/splitter

Tags:

d

phobos

Here's simple code:

import std.algorithm;
import std.array;
import std.file;

void main(string[] args)
{
    auto t = args[1].readText()
        .splitter('\n')
        .split("---")
    ;
}

Looks like it should work, but it won't compile. DMD 2.068.2 fails with this error:

Error: template std.algorithm.iteration.splitter cannot deduce function from
argument types !()(Result, string), candidates are:
...
Error: template instance std.array.split!(Result, string) error instantiating

It compiles if I insert .array before .split.

Am I missing something? Or is it a bug? I've tried to make a brief search in the bug tracker, but didn't found anything.

like image 433
sigod Avatar asked Oct 28 '15 00:10

sigod


1 Answers

Bottom line: problems like this can often be fixed by sticking a .array call right before the offending function. This provides it a buffer with enough functionality to run the algorithm.

What follows is the reasoning behind the library and a couple other ideas you can use to implement this too:

The reason this doesn't compile has to do with the philosophy behind std.algorithm and ranges: that they are as cheap as possible to push cost decisions to the top level.

In std.algorithm (and most well-written ranges and range-consuming algorithms), template constraints will reject any input that doesn't offer what it needs for free. Similarly, transforming ranges, like filter, splitter, etc., will return only those capabilities they can offer at minimal cost.

By rejecting them at compile time, they force the programmer to make the decision at the highest level as to how they want to pay those costs. You might rewrite the function to work differently, you might buffer it yourself with a variety of techniques to pay the costs up front, or whatever else you can find that works.

So here's what happens with your code: readText returns an array, which is a nearly full-featured range. (Since it returns a string, made of UTF-8, it doesn't actually offer random access as far as Phobos is concerned (though, confusing, the language itself sees it differently, search the D forums for the "autodecode" controversy if you want to learn more) since finding a Unicode code point in a list of variable-length utf-8 characters requires scanning it all. Scanning it all is not minimal cost, so Phobos will never attempt it unless you specifically ask for it.)

Anyway though, readText returns a range with plenty of features, including savability which splitter needs. Why does splitter need saving? Consider the result it promises: a range of strings starting at the last split point and continuing to the next split point. What does the implementation look like when writing this for a most generic range it can possibly do for cheap?

Something along these lines: first, save your starting position so you can return it later. Then, using popFront, advance through it until you find the split point. When it does, return the saved range up to the point of the split point. Then, popFront past the split point and repeat the process until you've consumed the whole thing (while(!input.empty)).

So, since splitter's implementation required the ability to save the starting point, it requires at least a forward range (which is just a savable range. Andrei now feels naming things like this is a bit silly because there's so many names, but at the time he was writing std.algorithm he still believed in giving them all names).

Not all ranges are forward ranges! Arrays are, saving them is as easy as returning a slice from the current position. Many numerical algorithms are too, saving them just means keeping a copy of the current state. Most transformation ranges are savable if the range they are transforming are savable - again, all they need to do is return the current state.

......as I write this, actually, I think your example should be savable. And, indeed, there is an overload that takes a predicate and compiles!

http://dlang.org/phobos/std_algorithm_iteration.html#.splitter.3

    import std.algorithm;
    import std.array;
    import std.stdio;

    void main(string[] args)
    {
            auto t = "foo\n---\nbar"
                    .splitter('\n')
                    .filter!(e => e.length)
                    .splitter!(a => a == "---")
            ;
            writeln(t);
    }

Output: [["foo"], ["bar"]]

Yea, it compiled and split on lines equal to a particular thing. The other overload, .splitter("---"), fails to compile, because that overload requires slice functionality (or a narrow string, which Phobos refuses to slice generically... but knows it actually can be anyway, so the function is special-cased. You see that all over the library.)

But, why does it require slicing instead of just saving? Honestly, I don't know. Maybe I'm missing something too, but the existence of the overload that does work implies to me that my conception of the algorithm is correct; it can be done this way. I do believe slicing is a bit cheaper, but the save version is cheap enough too (you'd keep a count of how many items you popped past to get to the splitter, then return saved.take(that_count).... maybe that's the reason right there: you would iterate over the items twice, once inside the algorithm, then again outside, and the library considers that sufficiently costly to punt up a level. (The predicate version sidesteps this by making your function do the scanning, and thus Phobos considers it not its problem anymore, you are aware of what your own function is doing.)

I can see the logic in that. I could go both ways on it though, cuz the decision to actually run over it again is still on the outside, but I don understand why that might not be desirable to do without some thought.

Finally, why doesn't splitter offer indexing or slicing on its output? Why doesn't filter offer it either? Why DOES map offer it?

Well, it has to do with that low cost philosophy again. map can offer it (assuming its input does) because map doesn't actually change the number of elements: the first element in the output is also the first element in the input, just with some function run on the result. Ditto for the last, and all others in between.

filter changes that though. Filtering out the odd numbers of [1,2,3] yields just [2]: the length is different and 2 is now found at the beginning instead of the middle. But, you can't know where it is until you actually apply the filter - you can't jump around without buffering the result.

splitter is similar to filter. It changes the placement of elements, and the algorithm doesn't know where it splits until it actually runs through the elements. So it can tell as you iterate, but not ahead of iteration, so indexing would be O(n) speed - computationally too expensive. Indexing is supposed to be extremely cheap.


Anyway, now that we understand why the principle is there - to let you, the end programmer make decisions about costly things like buffering (which requires more memory than is free) or additional iteration (which requires more CPU time than is cost-free to the algorithm), and have some idea as to why splitter needs it by thinking about its implementation, we can look at ways to satisfy the algorithm: we need to either use the version that eats a few more CPU cycles and write it with our custom compare function (see sample above), or provide slicing somehow. The most straightforward way is by buffering the result in an array.

import std.algorithm;
import std.array;
import std.file;

void main(string[] args)
{
    auto t = args[1].readText()
        .splitter('\n')
        .array // add an explicit buffering call, understanding this will cost us some memory and cpu time
        .split("---")
    ;
}

You might also buffer it locally or something yourself to reduce the cost of the allocation, but however you do it, the cost has to be paid somewhere and Phobos prefers you the programmer, who understands the needs of your program and if you are willing to pay these costs or not, to make that decision instead of it paying it on your behalf without telling you.

like image 54
Adam D. Ruppe Avatar answered Sep 22 '22 16:09

Adam D. Ruppe