I want to read all lines of a 1 GB large file as fast as possible into a Stream<String>
. Currently I'm using Files(path).lines()
for that. After parsing the file, I'm doing some computations (map()
/filter()
).
At first I thought this is already done in parallel, but it seems I'm wrong: when reading the file as it is, it takes about 50 seconds on my dual CPU laptop. However, if I split the file using bash commands and then process them in parallel, it only takes about 30 seconds.
I tried the following combinations:
Files(..).lines().parallel().[...]
~ 50 secondsFiles(..).lines().parallel().[...]
~ 30 secondsI ran these 4 multiple times with roughly the same results (by 1 or 2 seconds). The [...]
is a chain of map and filter only, with a toArray(...)
at the end to trigger the evaluation.
The conclusion is that there is no difference in using lines().parallel()
. As reading two files in parallel takes a shorter time, there is a performance gain from splitting the file. However it seems the whole file is read serially.
Edit:
I want to point out that I use an SSD, so there is practically no seeking time. The file has 1658652 (relatively short) lines in total.
Splitting the file in bash takes about 1.5 seconds:
time split -l 829326 file # 829326 = 1658652 / 2
split -l 829326 file 0,14s user 1,41s system 16% cpu 9,560 total
So my question is, is there any class or function in the Java 8 JDK which can parallelize reading all lines without having to split it first? For example, if I have two CPU cores,
the first line reader should start at the first line and a second one at line (totalLines/2)+1
.
When reading in parallel, each partition of the graph will process part of the file. A file split (or more commonly split) is a contiguous segment of a data file, spanning a range of bytes. DataFlow performs parallel reads by breaking files into a number of splits and assigning them to different partitions.
Java 8 has added a new method called lines() in the Files class which can be used to read a file line by line in Java. The beauty of this method is that it reads all lines from a file as Stream of String, which is populated lazily as the stream is consumed.
Java Parallel Streams is a feature of Java 8 and higher, meant for utilizing multiple cores of the processor. Normally any java code has one stream of processing, where it is executed sequentially.
Parallel Streams. Any stream in Java can easily be transformed from sequential to parallel. We can achieve this by adding the parallel method to a sequential stream or by creating a stream using the parallelStream method of a collection: List<Integer> listOfNumbers = Arrays.
You might find some help from this post. Trying to parallelize the actual reading of a file is probably barking up the wrong tree, as the biggest slowdown will be your file system (even on an SSD).
If you set up a file channel in memory, you should be able to process the data in parallel from there with great speed, but chances are you won't need it as you'll see a huge speed increase.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With