Why is the newer/faster Java 8 way of sorting acting worse?

Question

List<Point> pixels = new ArrayList<>(width * height); // 1280*960

for (int y = 0; y < height; y++)
    for (int x = 0; x < width; x++)
        pixels.add(new Point(x, y));

// Java 7 sorting
Collections.sort(pixels, comparator);

// Java 8 sorting
pixels = pixels.stream().parallel().sorted(comparator).collect(Collectors.toList());

When using any sorting method, I get slow performance at first and it improves later. I expect this, because the JIT compiler needs time to optimize the high-use code.

The weird thing is, the old sorter is a little slow at first, while the new sorter much more so, more than 60% slower. After a while, the new sorter becomes faster, as expected. But the way that the first two/three executions are so slow is simply unacceptable.

Java 7 collection sorter
0.258992413
0.265509443
0.536536068
0.117830618
0.136303916
0.111004611
0.134771877
0.108078261

Java 8 stream sorter
0.631757108
0.868032669
0.076455248
0.087101852
0.070401965
0.056989645
0.072018371
0.078908912
0.074237648

Specs:
CPU: Intel I7 3770 (8-core 8M/1M/128K cache)
cmd: javaw -server -cp bin Myclass

Has anyone else experienced worse performance of newer (stream) operations?
Is there a way to fix this slowness? (without inducing start-up delays)

Tagir Valeev · Accepted Answer

Seems that you care about the performance during the warm-up stage (namely, first and second sorting after the JVM start). So probably standard JMH benchmark is not suitable for you. That's ok, let's write the benchmark manually. As we are speaking about dozens of milliseconds, naive benchmark which uses System.nanoTime() would provide enough precision.

You did not supply your Comparator in the question. Simple comparator (like Comparator.comparingInt(p -> p.x)) sorts your data much faster, so I assume you have more complex comparator:

final Comparator<Point> comparator = Comparator.comparingInt(p -> p.x*p.x + p.y*p.y);

It compares points by Euclidean distance from the (0, 0) (no need to take square root as it's monotonic function, so order will not change).

Also let's separate the data preparation from the sorting to measure the sorting performance only:

private Point[] prepareData() {
    Point[] pixels = new Point[width*height];
    int idx = 0;
    for (int y = 0; y < height; y++)
        for (int x = 0; x < width; x++)
            pixels[idx++] = new Point(x, y);
    return pixels;
}

I used array instead of List to be able to test Arrays.parallelSort directly. The plain old sort would be like:

public List<Point> sortPlain(Point[] data) {
    List<Point> list = Arrays.asList(data);
    Collections.sort(list, comparator);
    return list;
}

The Parallel Stream API based sort would be

public List<Point> sortParallelStream(Point[] data) {
    return Stream.of(data).parallel().sorted(comparator).collect(Collectors.toList());
}

Let's also add the sequential Stream API version:

public List<Point> sortStream(Point[] data) {
    return Stream.of(data).sorted(comparator).collect(Collectors.toList());
}

And using parallelSort directly:

public List<Point> sortParallel(Point[] data) {
    Arrays.parallelSort(data, comparator);
    return Arrays.asList(data);
}

Measurement code is not very difficult. Here's complete implementation. Note that every test should be launched independently, so we test only one mode during the JVM launch. Here's typical results on my machine (i7-4702MQ 2.20GHz, 4 Cores HT = 8 HW threads, Win7 64bit, java 1.8.0_71).

Iter  Plain      Parallel   Stream     ParallelStream
#01:  0.38362s   0.37364s   0.28255s   0.47821s
#02:  0.23021s   0.25754s   0.18533s   0.72231s
#03:  0.18862s   0.08887s   0.21329s   0.18024s
#04:  0.19810s   0.06158s   0.68004s   0.12166s
#05:  0.19671s   0.06461s   0.17066s   0.08380s
#06:  0.14796s   0.05484s   0.18283s   0.12931s
#07:  0.16588s   0.04920s   0.21481s   0.13379s
#08:  0.21988s   0.05932s   0.19111s   0.12903s
#09:  0.14434s   0.05123s   0.14191s   0.11674s
#10:  0.18771s   0.06174s   0.14977s   0.07237s
#11:  0.15674s   0.05105s   0.21275s   0.06975s
#12:  0.17634s   0.06353s   0.14343s   0.07882s
#13:  0.15085s   0.05318s   0.16004s   0.11029s
#14:  0.18555s   0.05278s   0.19105s   0.12123s
#15:  0.14728s   0.05916s   0.14426s   0.07235s
#16:  0.18781s   0.05708s   0.21455s   0.07884s
#17:  0.14493s   0.12377s   0.14415s   0.11170s
#18:  0.14395s   0.05100s   0.18201s   0.07878s
#19:  0.14849s   0.05437s   0.14484s   0.08364s
#20:  0.14143s   0.12073s   0.18542s   0.11257s

Results for Plain and ParallelStream tests are somewhat similar to yours: first two iterations are much slower with ParallelStream (especially the second one). Also you can note that there's no such effect for direct execution of Arrays.parallelSort. Finally non-parallel stream is the slowest. That's because Stream API always uses intermediate buffer for sorting, so it needs more space and time to perform additional copying to the buffer, sorting it, then perform copying to the resulting list.

Why first two iterations for ParallelStream are so slow (especiall the second one)? Just because you have quite small starting heap to conveniently place all the intermediate buffers, so several full-gc events occur during the first two iterations which end up in significant delay. If you run your test with -verbose:gc you will see for ParallelStream:

[GC (Allocation Failure)  16384K->14368K(62976K), 0.0172833 secs]
[GC (Allocation Failure)  30752K->30776K(79360K), 0.0800204 secs]
[Full GC (Ergonomics)  30776K->30629K(111104K), 0.4487876 secs]
[GC (Allocation Failure)  63394K->74300K(111104K), 0.0215347 secs]
[Full GC (Ergonomics)  74300K->45460K(167936K), 0.1536388 secs]
[GC (Allocation Failure)  76592K->57710K(179712K), 0.0064693 secs]
#01: 0.41506s
[GC (Allocation Failure)  101713K->103534K(180224K), 0.0567087 secs]
[Full GC (Ergonomics)  103534K->39365K(203776K), 0.5636835 secs]
[GC (Allocation Failure)  84021K->53689K(266752K), 0.0103750 secs]
#02: 0.71832s

No more Full GC events after this as heap is sufficiently enlarged now. Compare with Plain launch:

[GC (Allocation Failure)  16384K->14400K(62976K), 0.0162299 secs]
[GC (Allocation Failure)  30784K->30784K(79360K), 0.0762906 secs]
[Full GC (Ergonomics)  30784K->30629K(111616K), 0.4548198 secs]
#01: 0.43610s
[GC (Allocation Failure)  63397K->58989K(111616K), 0.0330308 secs]
[Full GC (Ergonomics)  58989K->25278K(133120K), 0.2479148 secs]
#02: 0.20753s

Only two Full GC which took significantly less time, because there's significantly less garbage.

Let's set the initial heap size to 1Gb with -Xms1G to reduce the GC pressure. Now we have completely different results:

Iter  Plain      Parallel   Stream     ParallelStream
#01:  0.38349s   0.33331s   0.23834s   0.24078s      
#02:  0.18514s   0.20530s   0.16650s   0.07802s      
#03:  0.16642s   0.10417s   0.16267s   0.11826s      
#04:  0.16409s   0.05015s   0.19890s   0.06926s      
#05:  0.14475s   0.05241s   0.15041s   0.06932s      
#06:  0.14358s   0.05584s   0.14611s   0.06684s      
#07:  0.17644s   0.04913s   0.14619s   0.06716s      
#08:  0.14252s   0.04642s   0.19333s   0.10813s      
#09:  0.14427s   0.04547s   0.14673s   0.06900s      
#10:  0.14696s   0.04634s   0.14927s   0.06712s      
#11:  0.14254s   0.04682s   0.15107s   0.07874s      
#12:  0.15455s   0.09560s   0.19370s   0.06663s      
#13:  0.15544s   0.05133s   0.15110s   0.13052s      
#14:  0.18636s   0.04788s   0.15928s   0.06688s      
#15:  0.14824s   0.04833s   0.15218s   0.06624s      
#16:  0.15068s   0.04949s   0.19183s   0.13925s      
#17:  0.14605s   0.04695s   0.14770s   0.12714s      
#18:  0.14130s   0.04660s   0.14903s   0.15428s      
#19:  0.14695s   0.05491s   0.14389s   0.07467s      
#20:  0.15050s   0.04700s   0.18919s   0.07662s

Now results are much more stable even for Plain (because we have much less GC pauses) and ParallelStream now always better than Plain (though it still produces more objects, it's easier to allocate them and collect garbage when you have bigger heap). No full gc events were observed with -Xms1G for all four tests

So to conclude:

ParallelStream produces more garbage and does additional copying which slows down the operation.
After JVM startup default heap setting is too small, so garbage collector takes quite a lot of time until it decides to sufficiently increase the overall heap size.
If you want maximal parallel speed, use Arrays.parallelSort directly as it sorts in-place. Especially if you know in advance the size of your dataset.

Finally it should be noted that while you are calling Collections.sort(list, comparator) as "Java 7 collection sorter" when launched under Java-7 it works 8-10% slower, because Collections.sort implementation changed.

Why is the newer/faster Java 8 way of sorting acting worse?

Tags:

java

performance

sorting

java-8

java-stream

Mark Jeronimus

1 Answers

Tagir Valeev

Recent Activity

Donate For Us

Why is the newer/faster Java 8 way of sorting acting worse?

Tags:

java

performance

sorting

java-8

java-stream

Mark Jeronimus

1 Answers

Tagir Valeev

Related questions

Recent Activity

Donate For Us