List<Point> pixels = new ArrayList<>(width * height); // 1280*960
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x++)
pixels.add(new Point(x, y));
// Java 7 sorting
Collections.sort(pixels, comparator);
// Java 8 sorting
pixels = pixels.stream().parallel().sorted(comparator).collect(Collectors.toList());
When using any sorting method, I get slow performance at first and it improves later. I expect this, because the JIT compiler needs time to optimize the high-use code.
The weird thing is, the old sorter is a little slow at first, while the new sorter much more so, more than 60% slower. After a while, the new sorter becomes faster, as expected. But the way that the first two/three executions are so slow is simply unacceptable.
Java 7 collection sorter
0.258992413
0.265509443
0.536536068
0.117830618
0.136303916
0.111004611
0.134771877
0.108078261
Java 8 stream sorter
0.631757108
0.868032669
0.076455248
0.087101852
0.070401965
0.056989645
0.072018371
0.078908912
0.074237648
Specs:
CPU: Intel I7 3770 (8-core 8M/1M/128K cache)
cmd: javaw -server -cp bin Myclass
Seems that you care about the performance during the warm-up stage (namely, first and second sorting after the JVM start). So probably standard JMH benchmark is not suitable for you. That's ok, let's write the benchmark manually. As we are speaking about dozens of milliseconds, naive benchmark which uses System.nanoTime()
would provide enough precision.
You did not supply your Comparator
in the question. Simple comparator (like Comparator.comparingInt(p -> p.x)
) sorts your data much faster, so I assume you have more complex comparator:
final Comparator<Point> comparator = Comparator.comparingInt(p -> p.x*p.x + p.y*p.y);
It compares points by Euclidean distance from the (0, 0)
(no need to take square root as it's monotonic function, so order will not change).
Also let's separate the data preparation from the sorting to measure the sorting performance only:
private Point[] prepareData() {
Point[] pixels = new Point[width*height];
int idx = 0;
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x++)
pixels[idx++] = new Point(x, y);
return pixels;
}
I used array instead of List
to be able to test Arrays.parallelSort
directly. The plain old sort would be like:
public List<Point> sortPlain(Point[] data) {
List<Point> list = Arrays.asList(data);
Collections.sort(list, comparator);
return list;
}
The Parallel Stream API based sort would be
public List<Point> sortParallelStream(Point[] data) {
return Stream.of(data).parallel().sorted(comparator).collect(Collectors.toList());
}
Let's also add the sequential Stream API version:
public List<Point> sortStream(Point[] data) {
return Stream.of(data).sorted(comparator).collect(Collectors.toList());
}
And using parallelSort
directly:
public List<Point> sortParallel(Point[] data) {
Arrays.parallelSort(data, comparator);
return Arrays.asList(data);
}
Measurement code is not very difficult. Here's complete implementation. Note that every test should be launched independently, so we test only one mode during the JVM launch. Here's typical results on my machine (i7-4702MQ 2.20GHz, 4 Cores HT = 8 HW threads, Win7 64bit, java 1.8.0_71).
Iter Plain Parallel Stream ParallelStream
#01: 0.38362s 0.37364s 0.28255s 0.47821s
#02: 0.23021s 0.25754s 0.18533s 0.72231s
#03: 0.18862s 0.08887s 0.21329s 0.18024s
#04: 0.19810s 0.06158s 0.68004s 0.12166s
#05: 0.19671s 0.06461s 0.17066s 0.08380s
#06: 0.14796s 0.05484s 0.18283s 0.12931s
#07: 0.16588s 0.04920s 0.21481s 0.13379s
#08: 0.21988s 0.05932s 0.19111s 0.12903s
#09: 0.14434s 0.05123s 0.14191s 0.11674s
#10: 0.18771s 0.06174s 0.14977s 0.07237s
#11: 0.15674s 0.05105s 0.21275s 0.06975s
#12: 0.17634s 0.06353s 0.14343s 0.07882s
#13: 0.15085s 0.05318s 0.16004s 0.11029s
#14: 0.18555s 0.05278s 0.19105s 0.12123s
#15: 0.14728s 0.05916s 0.14426s 0.07235s
#16: 0.18781s 0.05708s 0.21455s 0.07884s
#17: 0.14493s 0.12377s 0.14415s 0.11170s
#18: 0.14395s 0.05100s 0.18201s 0.07878s
#19: 0.14849s 0.05437s 0.14484s 0.08364s
#20: 0.14143s 0.12073s 0.18542s 0.11257s
Results for Plain
and ParallelStream
tests are somewhat similar to yours: first two iterations are much slower with ParallelStream
(especially the second one). Also you can note that there's no such effect for direct execution of Arrays.parallelSort
. Finally non-parallel stream is the slowest. That's because Stream API always uses intermediate buffer for sorting, so it needs more space and time to perform additional copying to the buffer, sorting it, then perform copying to the resulting list.
Why first two iterations for ParallelStream
are so slow (especiall the second one)? Just because you have quite small starting heap to conveniently place all the intermediate buffers, so several full-gc events occur during the first two iterations which end up in significant delay. If you run your test with -verbose:gc
you will see for ParallelStream
:
[GC (Allocation Failure) 16384K->14368K(62976K), 0.0172833 secs]
[GC (Allocation Failure) 30752K->30776K(79360K), 0.0800204 secs]
[Full GC (Ergonomics) 30776K->30629K(111104K), 0.4487876 secs]
[GC (Allocation Failure) 63394K->74300K(111104K), 0.0215347 secs]
[Full GC (Ergonomics) 74300K->45460K(167936K), 0.1536388 secs]
[GC (Allocation Failure) 76592K->57710K(179712K), 0.0064693 secs]
#01: 0.41506s
[GC (Allocation Failure) 101713K->103534K(180224K), 0.0567087 secs]
[Full GC (Ergonomics) 103534K->39365K(203776K), 0.5636835 secs]
[GC (Allocation Failure) 84021K->53689K(266752K), 0.0103750 secs]
#02: 0.71832s
No more Full GC events after this as heap is sufficiently enlarged now. Compare with Plain
launch:
[GC (Allocation Failure) 16384K->14400K(62976K), 0.0162299 secs]
[GC (Allocation Failure) 30784K->30784K(79360K), 0.0762906 secs]
[Full GC (Ergonomics) 30784K->30629K(111616K), 0.4548198 secs]
#01: 0.43610s
[GC (Allocation Failure) 63397K->58989K(111616K), 0.0330308 secs]
[Full GC (Ergonomics) 58989K->25278K(133120K), 0.2479148 secs]
#02: 0.20753s
Only two Full GC which took significantly less time, because there's significantly less garbage.
Let's set the initial heap size to 1Gb with -Xms1G
to reduce the GC pressure. Now we have completely different results:
Iter Plain Parallel Stream ParallelStream
#01: 0.38349s 0.33331s 0.23834s 0.24078s
#02: 0.18514s 0.20530s 0.16650s 0.07802s
#03: 0.16642s 0.10417s 0.16267s 0.11826s
#04: 0.16409s 0.05015s 0.19890s 0.06926s
#05: 0.14475s 0.05241s 0.15041s 0.06932s
#06: 0.14358s 0.05584s 0.14611s 0.06684s
#07: 0.17644s 0.04913s 0.14619s 0.06716s
#08: 0.14252s 0.04642s 0.19333s 0.10813s
#09: 0.14427s 0.04547s 0.14673s 0.06900s
#10: 0.14696s 0.04634s 0.14927s 0.06712s
#11: 0.14254s 0.04682s 0.15107s 0.07874s
#12: 0.15455s 0.09560s 0.19370s 0.06663s
#13: 0.15544s 0.05133s 0.15110s 0.13052s
#14: 0.18636s 0.04788s 0.15928s 0.06688s
#15: 0.14824s 0.04833s 0.15218s 0.06624s
#16: 0.15068s 0.04949s 0.19183s 0.13925s
#17: 0.14605s 0.04695s 0.14770s 0.12714s
#18: 0.14130s 0.04660s 0.14903s 0.15428s
#19: 0.14695s 0.05491s 0.14389s 0.07467s
#20: 0.15050s 0.04700s 0.18919s 0.07662s
Now results are much more stable even for Plain
(because we have much less GC pauses) and ParallelStream
now always better than Plain
(though it still produces more objects, it's easier to allocate them and collect garbage when you have bigger heap). No full gc events were observed with -Xms1G
for all four tests
So to conclude:
Arrays.parallelSort
directly as it sorts in-place. Especially if you know in advance the size of your dataset.Finally it should be noted that while you are calling Collections.sort(list, comparator)
as "Java 7 collection sorter" when launched under Java-7 it works 8-10% slower, because Collections.sort
implementation changed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With