Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why my string manipulation is slow using lambda expression?

A method takes comma separated words as a String and returns a String of comma separated words with the words in natural sort order, not containing any 4 letter words, contain all words in UPPER case and no duplicates. The 1st approach is quite slow in comparison to the 2nd approach. Can you help me understand why and how can I improve my approach?

Approach 1:

public String stringProcessing(String s){
      Stream<String> tokens = Arrays.stream(s.split(","));
      return tokens.filter(t -> t.length() != 4) .distinct()
                   .sorted() 
                   .collect(Collectors.joining(",")).toUpperCase();
}

Approach 2:

public String processing(String s) {
    String[] tokens = s.split(",");
    Set<String> resultSet = new TreeSet<>();
    for(String t:tokens){
        if(t.length() !=  4)
            resultSet.add(t.toUpperCase());
    }        
    StringBuilder result = new StringBuilder();
    resultSet.forEach(key -> {
        result.append(key).append(","); 
    });
    result.deleteCharAt(result.length()-1);
    return result.toString();
}
like image 540
Kumar Manish Avatar asked Jan 30 '18 10:01

Kumar Manish


2 Answers

A performance comparison without documenting the used JRE version, input data sets nor benchmark methodology is not suitable to draw any conclusions.

Further, there are fundamental differences between your variants. You first variant processes the original strings when using distinct(), potentially keeping much more elements than the second variant, joins all of them to a string, before transforming the complete result string to upper case. In contrast, your second variant transforms individual elements before adding to the set, so only strings with a distinct upper case representation are processed further. So the second variant may need significantly less memory and process less elements when joining.

So when doing entirely different things, there is not much sense in comparing the performance of these operations. A better comparison would be between these two variants:

public String variant1(String s){
    Stream<String> tokens = Arrays.stream(s.split(","));
    return tokens.filter(t -> t.length() != 4)
                 .map(String::toUpperCase)
                 .sorted().distinct()
                 .collect(Collectors.joining(","));
}

public String variant2(String s) {
    String[] tokens = s.split(",");
    Set<String> resultSet = new TreeSet<>();
    for(String t:tokens){
        if(t.length() !=  4)
            resultSet.add(t.toUpperCase());
    }
    return String.join(",", resultSet);
}

Note that I changed the order of sorted() and distinct(); as discussed in this answer, applying distinct() directly after sorted() allows to exploit the sorted nature of the stream within the distinct operation.

You may also consider not creating a temporary array holding all substrings before streaming over them:

public String variant1(String s){
    return Pattern.compile(",").splitAsStream(s)
            .filter(t -> t.length() != 4)
            .map(String::toUpperCase)
            .sorted().distinct()
            .collect(Collectors.joining(","));
}

You may also add a third variant,

public String variant3(String s) {
    Set<String> resultSet = new TreeSet<>();
    int o = 0, p;
    for(p = s.indexOf(','); p>=0; p = s.indexOf(',', o=p+1)) {
        if(p-o == 4) continue;
        resultSet.add(s.substring(o, p).toUpperCase());
    }
    if(s.length()-o != 4) resultSet.add(s.substring(o).toUpperCase());
    return String.join(",", resultSet);
}

which doesn’t create an array of substrings and even skips the substring creation for those not matching the filter criteria. This isn’t meant to suggest to go such low level in production code, but that there always might be a faster variant, so it doesn’t matter whether the variant you’re using is the fastest, but rather whether it runs at reasonable speed while being maintainable.

like image 115
Holger Avatar answered Oct 15 '22 13:10

Holger


I guess it was only about time before some actually posted some JMH tests. I took Holger's methods and tested them:

@BenchmarkMode(value = { Mode.AverageTime })
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
@State(Scope.Benchmark)
public class StreamVsLoop {

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder().include(StreamVsLoop.class.getSimpleName())
                .build();
        new Runner(opt).run();
    }

    @Param(value = {
            "a, b, c",
            "a, bb, ccc, dddd, eeeee, ffffff, ggggggg, hhhhhhhh",
            "a, bb, ccc, dddd, eeeee, ffffff, ggggggg, hhhhhhhh, ooooooooo, tttttttttttttt, mmmmmmmmmmmmmmmmmm" })
    String s;

    @Benchmark
    @Fork(1)
    public String stream() {
        Stream<String> tokens = Arrays.stream(s.split(","));
        return tokens.filter(t -> t.length() != 4)
                .map(String::toUpperCase)
                .sorted().distinct()
                .collect(Collectors.joining(","));
    }

    @Benchmark
    @Fork(1)
    public String loop() {
        String[] tokens = s.split(",");
        Set<String> resultSet = new TreeSet<>();
        for (String t : tokens) {
            if (t.length() != 4) {
                resultSet.add(t.toUpperCase());
            }
        }
        return String.join(",", resultSet);
    }

    @Benchmark
    @Fork(1)
    public String sortedDistinct() {
        return Pattern.compile(",").splitAsStream(s)
                .filter(t -> t.length() != 4)
                .map(String::toUpperCase)
                .sorted()
                .distinct()
                .collect(Collectors.joining(","));
    }

    @Benchmark
    @Fork(1)
    public String distinctSorted() {
        return Pattern.compile(",").splitAsStream(s)
                .filter(t -> t.length() != 4)
                .map(String::toUpperCase)
                .distinct()
                .sorted()
                .collect(Collectors.joining(","));
    }
}

And here are the results:

 stream              3 args         574.042
 loop                3 args         393.364
 sortedDistinct      3 args         829.077
 distinctSorted      3 args         836.558

 stream              8 args         1144.488
 loop                8 args         1014.756
 sortedDistinct      8 args         1533.968
 distinctSorted      8 args         1745.055

 stream             11 args         1829.571
 loop               11 args         1514.138
 sortedDistinct     11 args         1940.256
 distinctSorted     11 args         2591.715

The results are sort of obvious, streams are slower, but not that much, probably the readability wins. Also, Holger is right (but he rarely, if ever, isn't)

like image 25
Eugene Avatar answered Oct 15 '22 13:10

Eugene