A method takes comma separated words as a String
and returns a String
of comma separated words with the words in natural sort order, not containing any 4 letter words, contain all words in UPPER case and no duplicates. The 1st approach is quite slow in comparison to the 2nd approach. Can you help me understand why and how can I improve my approach?
Approach 1:
public String stringProcessing(String s){
Stream<String> tokens = Arrays.stream(s.split(","));
return tokens.filter(t -> t.length() != 4) .distinct()
.sorted()
.collect(Collectors.joining(",")).toUpperCase();
}
Approach 2:
public String processing(String s) {
String[] tokens = s.split(",");
Set<String> resultSet = new TreeSet<>();
for(String t:tokens){
if(t.length() != 4)
resultSet.add(t.toUpperCase());
}
StringBuilder result = new StringBuilder();
resultSet.forEach(key -> {
result.append(key).append(",");
});
result.deleteCharAt(result.length()-1);
return result.toString();
}
A performance comparison without documenting the used JRE version, input data sets nor benchmark methodology is not suitable to draw any conclusions.
Further, there are fundamental differences between your variants. You first variant processes the original strings when using distinct()
, potentially keeping much more elements than the second variant, joins all of them to a string, before transforming the complete result string to upper case. In contrast, your second variant transforms individual elements before adding to the set, so only strings with a distinct upper case representation are processed further. So the second variant may need significantly less memory and process less elements when joining.
So when doing entirely different things, there is not much sense in comparing the performance of these operations. A better comparison would be between these two variants:
public String variant1(String s){
Stream<String> tokens = Arrays.stream(s.split(","));
return tokens.filter(t -> t.length() != 4)
.map(String::toUpperCase)
.sorted().distinct()
.collect(Collectors.joining(","));
}
public String variant2(String s) {
String[] tokens = s.split(",");
Set<String> resultSet = new TreeSet<>();
for(String t:tokens){
if(t.length() != 4)
resultSet.add(t.toUpperCase());
}
return String.join(",", resultSet);
}
Note that I changed the order of sorted()
and distinct()
; as discussed in this answer, applying distinct()
directly after sorted()
allows to exploit the sorted nature of the stream within the distinct operation.
You may also consider not creating a temporary array holding all substrings before streaming over them:
public String variant1(String s){
return Pattern.compile(",").splitAsStream(s)
.filter(t -> t.length() != 4)
.map(String::toUpperCase)
.sorted().distinct()
.collect(Collectors.joining(","));
}
You may also add a third variant,
public String variant3(String s) {
Set<String> resultSet = new TreeSet<>();
int o = 0, p;
for(p = s.indexOf(','); p>=0; p = s.indexOf(',', o=p+1)) {
if(p-o == 4) continue;
resultSet.add(s.substring(o, p).toUpperCase());
}
if(s.length()-o != 4) resultSet.add(s.substring(o).toUpperCase());
return String.join(",", resultSet);
}
which doesn’t create an array of substrings and even skips the substring creation for those not matching the filter criteria. This isn’t meant to suggest to go such low level in production code, but that there always might be a faster variant, so it doesn’t matter whether the variant you’re using is the fastest, but rather whether it runs at reasonable speed while being maintainable.
I guess it was only about time before some actually posted some JMH tests. I took Holger's methods and tested them:
@BenchmarkMode(value = { Mode.AverageTime })
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
@State(Scope.Benchmark)
public class StreamVsLoop {
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder().include(StreamVsLoop.class.getSimpleName())
.build();
new Runner(opt).run();
}
@Param(value = {
"a, b, c",
"a, bb, ccc, dddd, eeeee, ffffff, ggggggg, hhhhhhhh",
"a, bb, ccc, dddd, eeeee, ffffff, ggggggg, hhhhhhhh, ooooooooo, tttttttttttttt, mmmmmmmmmmmmmmmmmm" })
String s;
@Benchmark
@Fork(1)
public String stream() {
Stream<String> tokens = Arrays.stream(s.split(","));
return tokens.filter(t -> t.length() != 4)
.map(String::toUpperCase)
.sorted().distinct()
.collect(Collectors.joining(","));
}
@Benchmark
@Fork(1)
public String loop() {
String[] tokens = s.split(",");
Set<String> resultSet = new TreeSet<>();
for (String t : tokens) {
if (t.length() != 4) {
resultSet.add(t.toUpperCase());
}
}
return String.join(",", resultSet);
}
@Benchmark
@Fork(1)
public String sortedDistinct() {
return Pattern.compile(",").splitAsStream(s)
.filter(t -> t.length() != 4)
.map(String::toUpperCase)
.sorted()
.distinct()
.collect(Collectors.joining(","));
}
@Benchmark
@Fork(1)
public String distinctSorted() {
return Pattern.compile(",").splitAsStream(s)
.filter(t -> t.length() != 4)
.map(String::toUpperCase)
.distinct()
.sorted()
.collect(Collectors.joining(","));
}
}
And here are the results:
stream 3 args 574.042
loop 3 args 393.364
sortedDistinct 3 args 829.077
distinctSorted 3 args 836.558
stream 8 args 1144.488
loop 8 args 1014.756
sortedDistinct 8 args 1533.968
distinctSorted 8 args 1745.055
stream 11 args 1829.571
loop 11 args 1514.138
sortedDistinct 11 args 1940.256
distinctSorted 11 args 2591.715
The results are sort of obvious, streams are slower, but not that much, probably the readability wins. Also, Holger is right (but he rarely, if ever, isn't)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With