Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does iterating a mapped sorted stream evaluate more elements than necessary?

Consider this small program, where we take a stream, sort it, map it, then iterate it:

public class AlphabetOrdinals {

    private static final List<Character> ALPHABET = List.of('a', 'b', 'c', 'd', 'e', 'f');
    private static final int STOP_ORDINAL = 'b' - 'a';

    public static void main(String[] args) {
        System.out.println("java.runtime.version = " + System.getProperty("java.runtime.version"));

        Stream<Integer> ordinals = ALPHABET.stream()
                                           .sorted()
                                           .map(AlphabetOrdinals::ordinal);

        int count = 0;

        Iterator<Integer> iterator = ordinals.iterator();
        while (iterator.hasNext()) {
            int ordinal = iterator.next();
            if (ordinal > STOP_ORDINAL) {
                System.out.println("stopping at " + ordinal);
                break;
            }
            System.out.println("consuming " + ordinal);
            ++count;
        }

        System.out.println("consumed " + count + " ordinals");
    }

    private static int ordinal(char letter) {
        int ordinal = letter - 'a';
        System.out.println("performing EXTREMELY EXPENSIVE mapping of " + letter + " -> " + ordinal);
        return ordinal;
    }

}

This program is silly, but is simplified from a real program where the iteration is intertwined with an iteration over another stream, so i can't easily replace it with a takeWhile/forEach.

I would expect this program to print:

java.runtime.version = 11+28
performing EXTREMELY EXPENSIVE mapping of a -> 0
consuming 0
performing EXTREMELY EXPENSIVE mapping of b -> 1
consuming 1
performing EXTREMELY EXPENSIVE mapping of c -> 2
stopping at 2
consumed 2 ordinals

But it prints:

java.runtime.version = 11+28
performing EXTREMELY EXPENSIVE mapping of a -> 0
performing EXTREMELY EXPENSIVE mapping of b -> 1
performing EXTREMELY EXPENSIVE mapping of c -> 2
performing EXTREMELY EXPENSIVE mapping of d -> 3
performing EXTREMELY EXPENSIVE mapping of e -> 4
performing EXTREMELY EXPENSIVE mapping of f -> 5
consuming 0
consuming 1
stopping at 2
consumed 2 ordinals

If i remove the .sorted(), it prints what i expect.

Why does this happen?

In the real program, the mapping step involves reading a load of data from a slow network drive, so i am keen not to do it more times than absolutely necessary!

like image 340
Tom Anderson Avatar asked Nov 20 '19 18:11

Tom Anderson


1 Answers

Boring Answer:

This is just the way the streams API implementation was written.

Less Boring Answer:

A stream has some kind of chain of operations to apply to the input. For a stream of references the operation that gets added for sorting is: java.util.stream.SortedOps.RefSortingSink (assuming you have a similar JDK to me). For map it is:

new StatelessOp<P_OUT, R>(this, StreamShape.REFERENCE,
                                     StreamOpFlag.NOT_SORTED | StreamOpFlag.NOT_DISTINCT) {
            @Override
            Sink<P_OUT> opWrapSink(int flags, Sink<R> sink) {
                return new Sink.ChainedReference<P_OUT, R>(sink) {
                    @Override
                    public void accept(P_OUT u) {
                        downstream.accept(mapper.apply(u));
                    }
                };
            }
        };

The relevant parts of the implementation of java.util.stream.SortedOps.RefSortingSink are here:

  @Override
        public void begin(long size) {
            if (size >= Nodes.MAX_ARRAY_SIZE)
                throw new IllegalArgumentException(Nodes.BAD_SIZE);
            list = (size >= 0) ? new ArrayList<T>((int) size) : new ArrayList<T>();
        }

        @Override
        public void end() {
            list.sort(comparator);
            downstream.begin(list.size());
            if (!cancellationWasRequested) {
                list.forEach(downstream::accept);
            }
            else {
                for (T t : list) {
                    if (downstream.cancellationRequested()) break;
                    downstream.accept(t);
                }
            }
            downstream.end();
            list = null;
        }

        @Override
        public void accept(T t) {
            list.add(t);
        }

As you can see sorted passes on the entire list to the next operation in the chain(the next operation is called downstream). The map operation however takes anything it receives, uses the mapping function and passes it on to the downstream. This means that if you only use map you get the lazy expected behavior, while if you use sorted, the entire now sorted stream gets shoved down map's throat in list.forEach(downstream::accept), and map can't refuse to take it, or only take part of it.

like image 91
PiRocks Avatar answered Nov 09 '22 01:11

PiRocks