Java Streams sport both sorted
and limit
methods, which respectively return a sorted version of a stream and return a stream just returning a specified number of items of a stream. When these operations are applied in succession, such as in:
stream.sorted().limit(qty).collect(Collectors.toList())
is the sorting is performed in a way that sorts qty
items or is the entire list sorted? In other words, if qty
is fixed, is this operation in O(n)
? The documentation doesn't specify the performance of these methods alone or in conjunction with each other.
The reason I ask is that the obvious imperative implementation of these operations would be to sort and then limit, taking time Θ(n * log(n))
. But these operations together can be performed in O(n * log(qty))
and a smart streaming framework could view the entire stream before executing it to optimize this special case.
Let me start by making the general point that the Java language specification places few restrictions on how streams are implemented. So it's really not too meaningful to ask about the performance of Java streams: it will vary significantly between implementations.
Also note that Stream
is an interface. You can create your own class that implements Stream
to have any performance or special behaviour on sorted
that you want. So really asking about the performance of Stream
makes no sense even within the context of one implementation. The OpenJDK implementation has lots of classes that implement the Stream
interface.
Having said that, if we look at the OpenJDK implementation, sorting of streams ends up in SortedOps
class (see source here) you will find that the sorting methods end up returning extensions of stateful operations. For example:
private static final class OfInt extends IntPipeline.StatefulOp<Integer>
These methods check if the upstream is already sorted in which case they just pass it to the downstream. They also have special exceptions for sized streams (i.e. upstream) which preallocate the arrays that they end up sorting which will improve efficiency (over a SpinedBuffer
that they use for unknown size streams). But whenever the upstream is not already sorted they accept all items, then sort them and then send to the accept
method of the downstream instance.
So the conclusion from this is that the OpenJDK sorted
implementation collects all items, then sorts, then sends downstream. In some cases this will be wasting resources when the downstream will then discard some elements. You are free to implement your own specialised sort operation that is more efficient than this for special cases. Probably the most straightforward way is to implement a Collector
that keeps a list of the n largest or smallest items in the stream. Your operation might then look something like:
.collect(new CollectNthLargest(4)).stream()
To replace
.sorted().limit(4)
There's a special collector in my StreamEx library which performs this operation: MoreCollectors.least(qty)
:
List<?> result = stream.collect(MoreCollectors.least(qty));
It uses PriorityQueue inside and actually works significantly faster with small qty on unsorted inputs. Note however if input is mostly sorted, then sorted().limit(qty)
may work faster as TimSort is incredibly fast for presorted data.
That's implementation-dependent and might also depend on whether the stream pipeline can "see through" potential operations between the sorted()
and the limit()
.
Even if you were to ask about the OpenJDK implementation it is subject to change since the javadocs make no guarantees about the runtime behavior. But no, currently it does not implement a k-min selection algorithm.
You also have to keep in mind that sorted()
doesn't work on infinite streams unless they already have the SORTED
characteristic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With