Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are non-parallel Streams meant to do an operation in mass on big amount of data?

A few weeks ago, I was searching for a way to extract some specific value from a file and stumbled on this question which introduced me to the Stream Object.

My first instinct was to investigate if this object would help with other file operations, such as replacing several placeholders with corresponding values for which I used BufferedReader and FileWriter. I failed miserably at producing any working code, but since then I began taking interest on articles which covered the subject, so I could understand the intended use of Stream.

On the way, I stumbled upon Optional and came to a good understanding of it and can now identify the cases where I am comfortable using Optional while maintaining my code clean and understandable. However, I can't say this is the case for Stream, not mentioning that it may not have provided the performance gain I imagined it would bring and will still need a finally clause in cases where IO is involved.

Here is the main issue I've been trying to wrap my head around, keeping in mind that I mostly worked on one-thread programming until now: When is it prefered to use a Stream aside from parallel processing?

Is it to do an operation in bulk on a specific subset of a big collection of data, where Collection would have been used when trying to access and manipulate specific objects of the said collection? Although it seems to be the intended use, I'm still not sure that the example I linked at the beginning of my question is your typical use case.

Or is it only a construct used to make the code smaller thanks to lambda expression at the sacrifice of readability? (Nothing against lambda if used correctly, but most of the example of Stream usage I saw where quite illegible, which didn't help for my general understanding)

like image 697
Eldros Avatar asked Oct 29 '22 00:10

Eldros


1 Answers

I've always referred to the description on the Java 8 Streams API page to help me decide between a Collection and a Stream:

However, [the Streams API] has many benefits. First, the Streams API makes use of several techniques such as laziness and short-circuiting to optimize your data processing queries.

Both a Stream and a Collection can be used to apply a computation on every single element of a dataset before storing it. However, I've found Streams useful if my pipeline includes several distinct filter/sort/map operations for each data element, as the Stream API can optimize these calculations behind the scenes and has parallelization support built in as well.

I agree that readability can be affected both positively and negatively by using a Stream - you're correct that some Stream examples are completely unreadable, and I don't think that readability should be the key decision point for using a Stream over something else.

If you're truly optimizing for performance on a large dataset, consider using a toolset that's purpose-built for massive datasets instead.

like image 145
Adil B Avatar answered Nov 11 '22 14:11

Adil B