Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stream vs Collection as return type

I am going through the discussion on which is best way to design our API (Stream vs Collection as return type). The discussion in this post is very valuable.

@BrainGotez answer mentions this one condition where collections are better than streams. I couldn't quite understand what this means, can someone please help with an example of explanation?

"when there are strong consistency requirements, and you have to produce a consistent snapshot of a moving target."

My question is, specifically, what "strong consistency requirements" mean and "consistent snapshot of a moving target" mean in real world applications?

like image 466
kosa Avatar asked Sep 03 '21 03:09

kosa


People also ask

Why is a stream better than a collection?

Streams are not modifiable i.e one can't add or remove elements from streams. These are modifiable i.e one can easily add to or remove elements from collections. Streams are iterated internally by just mentioning the operations. Collections are iterated externally using loops.

What is the return type of stream?

stream() − Returns a sequential stream considering collection as its source. parallelStream() − Returns a parallel Stream considering collection as its source.

What is the advantage of streams over collections?

There are a lot of benefits to using streams in Java, such as the ability to write functions at a more abstract level which can reduce code bugs, compact functions into fewer and more readable lines of code, and the ease they offer for parallelization.

Which is faster stream or collection?

For this particular test, streams are about twice as slow as collections, and parallelism doesn't help (or either I'm using it the wrong way?).


3 Answers

"when there are strong consistency requirements, and you have to produce a consistent snapshot of a moving target."

What the author @Brian Goetz was referring to is the point in time when the stream gets consumed.

Here lays the first misunderstanding of the java.util.stream-API.

When you return a stream, you get a handle on an object, which did not started its pull yet.

Only when you invoke a termination method, the collection will get iterated. Until this point, the collection and their items can change. And this is the only lazy part about a stream. Otherwise you probably want to ride the bull of RxJava2.. ;- )

// EDIT FOR THAT BOUNTY:

A real world example would be: To this exact moment, which is the price of these specific shares?

Then you want to pass immutable objects, which one can use to place a order after inspecting.

If in meanwhile the price changes - but the object is required to place an order - you do not care how long your user takes to place it. The price was just fixed beforehand.

// EDIT END.

Anyhow, the same can happen to a collection until you start iterating. Both these cases are related to concurrent access.

Also, this isn't an iteration of the items per-se.
Each object is passed through the chain.

Therefore you have to approach the entire question differently, imho.

  • Should the collection be mutable or immutable?
  • Are you passing immutable objects? (If not, you need to consider the following question:)
  • Do you pass the references to the objects, so they can get altered or is a deep-copy required?

So after these questions are answered, let's talk about a disadvantage of streams: O(n) access. The user wants to access an object at index. First, he has to iterate all objects to append it to a new data structure. Or he has to iterate in order until this item is visited. The latter only in the worst-case-scenario but - A new data structure just doubled the heap-memory allocation. And this also will affect the garbage collection afterwards.

But why are streams so darn cute?

  1. Because you can write code which is just more readable. That's it! When all the client does is consuming the items, then it is good advice for him to use streams. This way his code-base is more readable.
  2. There is this big elephant in the room - concurrency. When used appropriately, it is cheap (in terms of development time) to introduce mature multi-threading.
  3. Streams implement the AutoClosable-Interface, which is nice.

Elaborating on the third point: When you need to close a resource after consuming, it is always necessary to do this on your own. Therefore a Visitor-Pattern is the more applicable option - And within this the user can choose on its own, if he wants to use a stream or a collection. :- )

Imo, you should always stick to collections for an api. This way you are not requiring the familiarity of the stream-api. Anybody who wants to use streams can do so on (in) their own.

// EDIT 2: Elaborate on the confusion of streams - OPINIONATED

This "strong consistency requirements" seems related to more of design requirement. I would be happy to provide the bounty if the answer has details with authoritative references.

It is not about streams vs. collections. It is about the point-in-time one consumes the collection (both are collections anyway). If your user only wants to get the current state of objects, you return a collection. If your user wants to subscribe to new items, he would register an Observable at your api.

This is, imo, were the confusion about streams is rooted. There are the libraries from https://reactiveX.io which provide an stream-like interface to subscribing to a data source.

This picture shows the time-line of one of their classes. Observable: Time Line What is happening is quite simple: The caller registers transformation-methods and callbacks which are invoked, once you start to emit items. This is the exact old principle of an Observer-Callback. I would highly advice against using Observables for various reasons.

  1. All colleagues have to be familiar with them
  2. Debugging will get harder, since the callstacks are way more verbose.
  3. One can easily end up in callback-hell.
  4. Application is highly specialized, use them rarely. They are a good fit if you are emitting the same items for every user continuously. If you are doing normal CRUD-operations, don't introduce Observables.

They are fun, though. :- )

like image 180
4 revs, 2 users 94% Avatar answered Oct 16 '22 16:10

4 revs, 2 users 94%


In this context, the notion of "strong consistency requirement" is relative to the system or application within which the code resides. There's no specific notion of "strong consistency" that's independent of the system or application. Here's an example of "consistency" that is determined by what assertions you can make about a result. It should be clear that the semantics of these assertions are entirely application-specific.

Suppose you have some code that implements a room where people can enter and leave. You might want the relevant methods to be synchronized so that all enter and leave actions occur in some order. For example: (using Java 16)

record Person(String name) { }

public class Room {
    final Set<Person> occupants = Collections.newSetFromMap(new ConcurrentHashMap<>());

    public synchronized void enter(Person p) { occupants.add(p); }
    public synchronized void leave(Person p) { occupants.remove(p); }
    public Stream<Person> occupants() { return occupants.stream(); }
}

(Note, I'm using ConcurrentHashMap here because it doesn't throw ConcurrentModificationException if it's modified during iteration.)

Next, consider some threads to execute these methods in this order:

room.enter(new Person("Brett"));
room.enter(new Person("Chris"));
room.enter(new Person("Dana"));
room.leave(new Person("Dana"));
room.enter(new Person("Ashley"));

Now, at around the same time, suppose a caller gets a list of persons in the room by doing this:

List<Person> occupants1 = room.occupants().toList();

The result might be:

[Dana, Brett, Chris, Ashley]

How is this possible? The stream is lazily evaluated, and the elements are being pulled into a List at the same time other threads are modifying the source of the stream. In particular, it's possible for the stream to have "seen" Dana, then Dana is removed and Ashley added, and then the stream advances and encounters Ashley.

What does the stream represent, then? To find out, we have to dig into what ConcurrentHashMap says about its streams in the presence of concurrent modification. The set is built from CHM's keySet view, which says "The view's iterators and spliterators are weakly consistent." The definition of weakly consistent is in turn:

Most concurrent Collection implementations (including most Queues) also differ from the usual java.util conventions in that their Iterators and Spliterators provide weakly consistent rather than fast-fail traversal:

  • they may proceed concurrently with other operations
  • they will never throw ConcurrentModificationException
  • they are guaranteed to traverse elements as they existed upon construction exactly once, and may (but are not guaranteed to) reflect any modifications subsequent to construction.

What does this mean for our Room application? I'd say it means that if a person appears in the stream of occupants, that person was in the room at some point. That's a pretty weak statement. Note in particular that it does not allow you say that Dana and Ashley were in the room at the same time. It might seem that way from the contents of the List, but that would be incorrect, as a simple inspection reveals.

Now suppose we were to change the Room class to return a List instead of a Stream, and the caller were to use that instead:

// in class Room
public synchronized List<Person> occupants() { return List.copyOf(occupants); }

// in the caller
List<Person> occupants2 = room.occupants();

The result might be:

[Dana, Brett, Chris]

You can make much stronger statements about this List than about the previous one. You can say that Chris and Dana were in the room at the same time, and that at this particular point in time, that Ashley was not in the room.

The List version of occupants() gives you a snapshot of the occupants of the room at a particular time. This allows you much stronger statements than the stream version, which only tells you that certain persons were in the room at some point.

Why would you ever want an API with weaker semantics? Again, it depends on the application. If you want to send a survey to people who used room, all you care about is whether they were ever in the room. You don't care about other things, like who else was in the room at the same time.

The API with stronger semantics is potentially more expensive. It needs to make a copy of the collection, which means allocating space and spending time copying. It needs to hold a lock while it does this, to prevent concurrent modification, and this temporarily blocks other updates from proceeding.

To summarize, the notion of "strong" or "weak" consistency is highly dependent on the context. In this case I made up an example with some associated semantics, such as "in the room at the same time" or "was in the room at some point in time." The semantics required by the application determine the strength or weakness of the consistency of the results. This in turn drives what Java mechanisms should be used, such as streams vs. collections and when to apply locks.

like image 32
Stuart Marks Avatar answered Oct 16 '22 15:10

Stuart Marks


So basically when you return a collection, you are returning the snapshot of players object at that particular moment. That is, a copy of players object at the time of calling "getPlayersAsCollection" method in this case. Any change by other threads to players list will not be reflected to the collection returned earlier. This explains, the consistency is maintained and at the time of calling getPlayersAsCollection method you actually got what's present in the players list which is constantly being modified by adding new player details or removing player details from it. And that explains consistent snapshot of a moving target.

class Team {
    private List<Player> players = new ArrayList<>();

    // ...

    public List<Player> getPlayersAsCollection() {
        return Collections.unmodifiableList(players);
    }

    public Stream<Player> getPlayersAsStream() {
        return players.stream();
    }
}

Whereas, when a stream is returned here, it will be like the pointer to the list players is returned. Any change to players object in between the Stream is returned by "getPlayersAsStream" method and when you try to access or perform stream operations on stream object the change done on players object will also be reflected here. So there is "no strong consistency" in this case as data is changed from the time getPlayersAsStream is called and got the response and when you tried to access that response(Stream).

But again, returning Stream has its own advantages as it was explained in the link shared in the question. It depends on the particular use case whether to return Stream or Collection.

I hope this helps and clarifies your doubt on "when there are strong consistency requirements, and you have to produce a consistent snapshot of a moving target."

like image 2
Satyam Singh Avatar answered Oct 16 '22 16:10

Satyam Singh