Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read all lines of a file in parallel in Java 8

I want to read all lines of a 1 GB large file as fast as possible into a Stream<String>. Currently I'm using Files(path).lines() for that. After parsing the file, I'm doing some computations (map()/filter()).

At first I thought this is already done in parallel, but it seems I'm wrong: when reading the file as it is, it takes about 50 seconds on my dual CPU laptop. However, if I split the file using bash commands and then process them in parallel, it only takes about 30 seconds.

I tried the following combinations:

  1. single file, no parallel lines() stream ~ 50 seconds
  2. single file, Files(..).lines().parallel().[...] ~ 50 seconds
  3. two files, no parallel lines() strean ~ 30 seconds
  4. two files, Files(..).lines().parallel().[...] ~ 30 seconds

I ran these 4 multiple times with roughly the same results (by 1 or 2 seconds). The [...] is a chain of map and filter only, with a toArray(...) at the end to trigger the evaluation.

The conclusion is that there is no difference in using lines().parallel(). As reading two files in parallel takes a shorter time, there is a performance gain from splitting the file. However it seems the whole file is read serially.

Edit:
I want to point out that I use an SSD, so there is practically no seeking time. The file has 1658652 (relatively short) lines in total. Splitting the file in bash takes about 1.5 seconds:

   time split -l 829326 file # 829326 = 1658652 / 2
   split -l 829326 file  0,14s user 1,41s system 16% cpu 9,560 total

So my question is, is there any class or function in the Java 8 JDK which can parallelize reading all lines without having to split it first? For example, if I have two CPU cores, the first line reader should start at the first line and a second one at line (totalLines/2)+1.

like image 525
user3001 Avatar asked Sep 07 '14 15:09

user3001


People also ask

How do I read parallel files?

When reading in parallel, each partition of the graph will process part of the file. A file split (or more commonly split) is a contiguous segment of a data file, spanning a range of bytes. DataFlow performs parallel reads by breaking files into a number of splits and assigning them to different partitions.

What is the easiest way to read text files line by line in Java 8?

Java 8 has added a new method called lines() in the Files class which can be used to read a file line by line in Java. The beauty of this method is that it reads all lines from a file as Stream of String, which is populated lazily as the stream is consumed.

What is parallel stream Java 8?

Java Parallel Streams is a feature of Java 8 and higher, meant for utilizing multiple cores of the processor. Normally any java code has one stream of processing, where it is executed sequentially.

Are Java streams run in parallel?

Parallel Streams. Any stream in Java can easily be transformed from sequential to parallel. We can achieve this by adding the parallel method to a sequential stream or by creating a stream using the parallelStream method of a collection: List<Integer> listOfNumbers = Arrays.


1 Answers

You might find some help from this post. Trying to parallelize the actual reading of a file is probably barking up the wrong tree, as the biggest slowdown will be your file system (even on an SSD).

If you set up a file channel in memory, you should be able to process the data in parallel from there with great speed, but chances are you won't need it as you'll see a huge speed increase.

like image 110
matthewmatician Avatar answered Nov 04 '22 17:11

matthewmatician