I've been playing with Java 8 Streams - API
and I decided to microbenchmark stream()
and parallelStream()
streams. As expected the parallelStream()
was as twice as fast, but something else popped up - If I sort the data before passing them to the filter
it takes 5-8 times more time to filter->map->collect
the result, than passing an unsorted list.
(Stream) Elapsed time [ns] : 53733996 (53 ms)
(ParallelStream) Elapsed time [ns] : 25901907 (25 ms)
(Stream) Elapsed time [ns] : 336976149 (336 ms)
(ParallelStream) Elapsed time [ns] : 204781387 (204 ms)
package com.github.svetlinzarev.playground.javalang.lambda;
import static java.lang.Long.valueOf;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.stream.Collectors;
import com.github.svetlinzarev.playground.util.time.Stopwatch;
public class MyFirstLambda {
private static final int ELEMENTS = 1024 * 1024 * 16;
private static List<Integer> getRandom(int nElements) {
final Random random = new Random();
final List<Integer> data = new ArrayList<Integer>(nElements);
for (int i = 0; i < MyFirstLambda.ELEMENTS; i++) {
data.add(random.nextInt(MyFirstLambda.ELEMENTS));
}
return data;
}
private static void benchStream(List<Integer> data) {
final Stopwatch stopwatch = new Stopwatch();
final List<Long> smallLongs = data.stream()
.filter(i -> i.intValue() < 16)
.map(Long::valueOf)
.collect(Collectors.toList());
stopwatch.log("Stream");
System.out.println(smallLongs);
}
private static void benchParallelStream(List<Integer> data) {
final Stopwatch stopwatch = new Stopwatch();
final List<Long> smallLongs = data.parallelStream()
.filter(i -> i.intValue() < 16)
.map(Long::valueOf)
.collect(Collectors.toList());
stopwatch.log("ParallelStream");
System.out.println(smallLongs);
}
public static void main(String[] args) {
final List<Integer> data = MyFirstLambda.getRandom(MyFirstLambda.ELEMENTS);
// Collections.sort(data, (first, second) -> first.compareTo(second)); //<- Sort the data
MyFirstLambda.benchStream(data);
MyFirstLambda.benchParallelStream(data);
MyFirstLambda.benchStream(data);
MyFirstLambda.benchParallelStream(data);
MyFirstLambda.benchStream(data);
MyFirstLambda.benchParallelStream(data);
MyFirstLambda.benchStream(data);
MyFirstLambda.benchParallelStream(data);
MyFirstLambda.benchStream(data);
MyFirstLambda.benchParallelStream(data);
}
}
Here is a better benchmark code
package com.github.svetlinzarev.playground.javalang.lambda;
import static java.lang.Long.valueOf;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.stream.Collectors;
import com.github.svetlinzarev.playground.util.time.Stopwatch;
public class MyFirstLambda {
private static final int ELEMENTS = 1024 * 1024 * 10;
private static final int SMALLER_THAN = 16;
private static final int WARM_UP_ITERRATIONS = 1000;
private static List<Integer> getRandom(int nElements) {
final Random random = new Random();
final List<Integer> data = new ArrayList<Integer>(nElements);
for (int i = 0; i < MyFirstLambda.ELEMENTS; i++) {
data.add(random.nextInt(MyFirstLambda.ELEMENTS));
}
return data;
}
private static List<Long> filterStream(List<Integer> data) {
final List<Long> smallLongs = data.stream()
.filter(i -> i.intValue() < MyFirstLambda.SMALLER_THAN)
.map(Long::valueOf)
.collect(Collectors.toList());
return smallLongs;
}
private static List<Long> filterParallelStream(List<Integer> data) {
final List<Long> smallLongs = data.parallelStream()
.filter(i -> i.intValue() < MyFirstLambda.SMALLER_THAN)
.map(Long::valueOf)
.collect(Collectors.toList());
return smallLongs;
}
private static long filterAndCount(List<Integer> data) {
return data.stream()
.filter(i -> i.intValue() < MyFirstLambda.SMALLER_THAN)
.count();
}
private static long filterAndCountinParallel(List<Integer> data) {
return data.parallelStream()
.filter(i -> i.intValue() < MyFirstLambda.SMALLER_THAN)
.count();
}
private static void warmUp(List<Integer> data) {
for (int i = 0; i < MyFirstLambda.WARM_UP_ITERRATIONS; i++) {
MyFirstLambda.filterStream(data);
MyFirstLambda.filterParallelStream(data);
MyFirstLambda.filterAndCount(data);
MyFirstLambda.filterAndCountinParallel(data);
}
}
private static void benchmark(List<Integer> data, String message) throws InterruptedException {
System.gc();
Thread.sleep(1000); // Give it enough time to complete the GC cycle
final Stopwatch stopwatch = new Stopwatch();
MyFirstLambda.filterStream(data);
stopwatch.log("Stream: " + message);
System.gc();
Thread.sleep(1000); // Give it enough time to complete the GC cycle
stopwatch.reset();
MyFirstLambda.filterParallelStream(data);
stopwatch.log("ParallelStream: " + message);
System.gc();
Thread.sleep(1000); // Give it enough time to complete the GC cycle
stopwatch.reset();
MyFirstLambda.filterAndCount(data);
stopwatch.log("Count: " + message);
System.gc();
Thread.sleep(1000); // Give it enough time to complete the GC cycle
stopwatch.reset();
MyFirstLambda.filterAndCount(data);
stopwatch.log("Count in parallel: " + message);
}
public static void main(String[] args) throws InterruptedException {
final List<Integer> data = MyFirstLambda.getRandom(MyFirstLambda.ELEMENTS);
MyFirstLambda.warmUp(data);
MyFirstLambda.benchmark(data, "UNSORTED");
Collections.sort(data, (first, second) -> first.compareTo(second));
MyFirstLambda.benchmark(data, "SORTED");
Collections.sort(data, (first, second) -> second.compareTo(first));
MyFirstLambda.benchmark(data, "IN REVERSE ORDER");
}
}
And again the results are similar:
16:09:20.470 [main] INFO c.g.s.playground.util.time.Stopwatch - (Stream: UNSORTED) Elapsed time [ns] : 66812263 (66 ms)
16:09:22.149 [main] INFO c.g.s.playground.util.time.Stopwatch - (ParallelStream: UNSORTED) Elapsed time [ns] : 39580682 (39 ms)
16:09:23.875 [main] INFO c.g.s.playground.util.time.Stopwatch - (Count: UNSORTED) Elapsed time [ns] : 97852866 (97 ms)
16:09:25.537 [main] INFO c.g.s.playground.util.time.Stopwatch - (Count in parallel: UNSORTED) Elapsed time [ns] : 94884189 (94 ms)
16:09:35.608 [main] INFO c.g.s.playground.util.time.Stopwatch - (Stream: SORTED) Elapsed time [ns] : 361717676 (361 ms)
16:09:38.439 [main] INFO c.g.s.playground.util.time.Stopwatch - (ParallelStream: SORTED) Elapsed time [ns] : 150115808 (150 ms)
16:09:41.308 [main] INFO c.g.s.playground.util.time.Stopwatch - (Count: SORTED) Elapsed time [ns] : 338335743 (338 ms)
16:09:44.209 [main] INFO c.g.s.playground.util.time.Stopwatch - (Count in parallel: SORTED) Elapsed time [ns] : 370968432 (370 ms)
16:09:50.693 [main] INFO c.g.s.playground.util.time.Stopwatch - (Stream: IN REVERSE ORDER) Elapsed time [ns] : 352036140 (352 ms)
16:09:53.323 [main] INFO c.g.s.playground.util.time.Stopwatch - (ParallelStream: IN REVERSE ORDER) Elapsed time [ns] : 151044664 (151 ms)
16:09:56.159 [main] INFO c.g.s.playground.util.time.Stopwatch - (Count: IN REVERSE ORDER) Elapsed time [ns] : 359281197 (359 ms)
16:09:58.991 [main] INFO c.g.s.playground.util.time.Stopwatch - (Count in parallel: IN REVERSE ORDER) Elapsed time [ns] : 353177542 (353 ms)
So, my question is why filtering an unsorted list is faster than filtering a sorted list ?
In C++, it is faster to process a sorted array than an unsorted array because of branch prediction. In computer architecture, a branch prediction determines whether a conditional branch (jump) in the instruction flow of a program is likely to be taken or not. Branch prediction doesn't play a significant role here.
Essentially, sorting and filtering are tools that let you organize your data. When you sort data, you are putting it in order. Filtering data lets you hide unimportant data and focus only on the data you're interested in.
The advantage of a sorted array for the "target sum" problem is even better: you don't search the array at all. Instead, you start with pointers at the two ends. If the sum is equal to the target, emit that and move both pointers in. If less than the target, increment the lower pointer.
A sorted array is an array that is organised in a specific order (alphabetic, chronological, ordinal, cardinal order). An unsorted array is one that is not organised in any specific order.
When you are using the unsorted list all tuples are accessed in memory-order. They have been allocated consecutively in RAM. CPUs love accessing memory sequentially because they can speculatively request the next cache line so it will always be present when needed.
When you are sorting the list you put it into random order because your sort keys are randomly generated. This means that the memory accesses to tuple members are unpredictable. The CPU cannot prefetch memory and almost every access to a tuple is a cache miss.
This is a nice example for a specific advantage of GC memory management: data structures which have been allocated together and are used together perform very nicely. They have great locality of reference.
The penalty from cache misses outweighs the saved branch prediction penalty in this case.
This question's accepted answer, answers my question too: Why is processing a sorted array slower than an unsorted array?
When I create the original List
sorted - i.e. it's elements are sequentally in memory, there is no difference in the execution time and it is equal to the unsorted
version when the List
is filled with random numbers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With