Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert `BufferedReader` to `Stream<String>` in a parallel way

Is there a way to receive a Stream<String> stream out of a BufferedReader reader such that each string in stream represents one line of reader with the additional condition that stream is provided directly (before readerread everything)? I want to process the data of stream parallel to getting them from reader to save time.

Edit: I want to process the data parallel to reading. I don't want to process different lines parallel. They should be processed in order.

Let's make an example on how I want to save time. Let's say our reader will present 100 lines to us. It takes 2 ms to read one line and 1 ms to process it. If I first read all the lines and then process them, it will take me 300 ms. What I want to do is: As soon as a line is read I want to process it and parallel read the next line. The total time will then be 201 ms.

What I don't like about BufferedReader.lines(): As far as I understood reading starts when I want to process the lines. Let's assume I have already my reader but have to do precomputations before being able to process the first line. Let's say they cost 30 ms. In the above example the total time would then be 231 ms or 301 ms using reader.lines() (can you tell me which of those times is correct?). But it would be possible to get the job done in 201 ms, since the precomputations can be done parallel to reading the first 15 lines.

like image 727
principal-ideal-domain Avatar asked May 12 '15 16:05

principal-ideal-domain


3 Answers

You can use reader.lines().parallel(). This way your input will be split into chunks and further stream operations will be performed on chunks in parallel. If further operations take significant time, then you might get performance improvement.

In your case default heuristic will not work as you want and I guess there's no ready solution which will allow you to use single line batches. You can write a custom spliterator which will split after each line. Look into java.util.Spliterators.AbstractSpliterator implementation. Probably the easiest solution would be to write something similar, but limit batch sizes to one element in trySplit and read single line in tryAdvance method.

like image 70
Tagir Valeev Avatar answered Oct 05 '22 05:10

Tagir Valeev


To do what you want, you would typically have one thread that reads lines and add them to a blocking queue, and a second thread that would get lines from this blocking queue and process them.

like image 37
JB Nizet Avatar answered Oct 05 '22 05:10

JB Nizet


You are looking at the wrong place. You are thinking that a stream of lines will read lines from the file but that’s not how it works. You can’t tell the underlying system to read a line as no-one knows what a line is before reading.

A BufferedReader has it’s name because of it’s character buffer. This buffer has a default capacity of 8192. Whenever a new line is requested, the buffer will be parsed for a newline sequence and the part will be returned. When the buffer does not hold enough characters for finding a current line, the entire buffer will be filled.

Now, filling the buffer may lead to requests to read bytes from the underlying InputStream to fill the buffer of the character decoder. How many bytes will be requested and how many bytes will be actually read depends on the buffer size of the character decoder, on how much bytes of the actual encoding map to one character and whether the underlying InputStream has its own buffer and how big it is.

The actual expensive operation is the reading of bytes from the underlying stream and there is no trivial mapping from line read requests to these read operations. Requesting the first line may cause reading, let’s say one 16 KiB chunk from the underlying file, and the subsequent one hundred requests might be served from the filled buffer and cause no I/O at all. And nothing you do with the Stream API can change anything about that. The only thing you would parallelize is the search for new line characters in the buffer which is too trivial to benefit from parallel execution.

You could reduce the buffer sizes of all involved parties to roughly get your intended parallel reading of one line while processing the previous line, however, that parallel execution will never compensate the performance degradation caused by the small buffer sizes.

like image 32
Holger Avatar answered Oct 05 '22 05:10

Holger