Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

If a String is in the list (given at compile-time): Is HashSet the fastest solution?

Given a fixed list of strings at compile-time e.g.:

"zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"

Utilizing HashSet we have a very fast way (O(1)) to tell if a String provided at runtime is in the list of strings.

For example:

Set<String> SET = new HashSet<>(Arrays.asList( "zero", "one", "two", "three",
        "four", "five", "six", "seven", "eight", "nine"));

boolean listed = SET.contains("some-text");

Are there any other faster ways to tell if a String provided at runtime is in the initial list of strings (given at compile-time) or is HashSet the fastest solution?

Let's Formalize the Question

Given the interface below:

interface Checker {
    String[] VALUES = { "zero", "one", "two", "three", "four", "five", "six",
            "seven", "eight", "nine" };

    boolean contains(String s);
}

Provide the fastest possible implementation given that the values listed in Checker.VALUES will not be changed (so for example you can use those literals in your code if you want to).

Demonstration: HashSetChecker Implementation

The implementation which uses a HashSet would look like this:

class HashSetChecker implements Checker {
    private final Set<String> set = new HashSet<>(Arrays.asList(VALUES));
    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

Code to Test an Implementation

When testing, we want to test the Checker.contains() method with the initial interned strings, also with other interned strings (String literals) which will not be found, and also Strings that have the same values (are equals) but are not interned Strings. For this purpose the following CheckerTester.TESTS array will be used.

public class CheckerTester {
    private static final String[] TESTS = { "zero", "one", "two", "three",
            "four", "five", "six", "seven", "eight", "nine", "ten", "eleven",
            "twelve", "thirteen", "fourteen", "fifteen", "sixteen",
            "seventeen", "eighteen", "nineteen",

            new String("zero"), new String("one"), new String("two"),
            new String("three"), new String("four"), new String("five"),
            new String("six"), new String("seven"), new String("eight"),
            new String("nine"), new String("ten"), new String("eleven"),
            new String("twelve"), new String("thirteen"),
            new String("fourteen"), new String("fifteen"),
            new String("sixteen"), new String("seventeen"),
            new String("eighteen"), new String("nineteen") };

    public static void test(Checker checker) {
        final int N = 1_000_000;

        long start = System.nanoTime();

        for (int i = 0; i < N; i++)
            for (String test : TESTS)
                checker.contains(test);

        long end = System.nanoTime();

        System.out.printf("%s: %d ms\n", checker.getClass().getName(),
                (end - start) / 1_000_000);
    }
}
like image 591
icza Avatar asked Mar 19 '23 06:03

icza


1 Answers

Let's see some implementations:

First idea: HashSet with big capacity

Some might say that multiple Strings might end up in the same HashSet bucket, so let's use a big initial capacity:

class HashSetChecker2 implements Checker {
    private final Set<String> set = new HashSet<>(1000);
    { set.addAll(Arrays.asList(VALUES)); }

    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

Using HashMap instead of HashSet

HashSet uses HashMap in its implementation, let's just do the same: get rid of the "shell" HashSet:

class HashMapChecker implements Checker {
    private final Map<String, Object> map = new HashMap<>(1000);
    {
        for (String s : VALUES)
            map.put(s, s);
    }
    @Override
    public boolean contains(String s) {
        return map.containsKey(s);
    }
}

TreeSet

Some might say give TreeSet a try too (it's sorted so it might have a chance). I know it's O(log(n)), but n is small (10 in this case):

class TreeSetChecker implements Checker {
    private final Set<String> set = new TreeSet<>(Arrays.asList(VALUES));
    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

Short-circuit OR checks (if-else logic):

class OrChecker implements Checker {
    @Override
    public boolean contains(String s) {
        return "zero".equals(s) || "one".equals(s) || "two".equals(s)
                || "three".equals(s) || "four".equals(s) || "five".equals(s)
                || "six".equals(s) || "seven".equals(s) || "eight".equals(s)
                || "nine".equals(s);
    }
}

Reference equality then revert to short-circuit OR checks

Some might say that we should first check if by reference we have the String and if not then revert to short-circuit OR checks:

class RefOrChecker extends OrChecker {
    @Override
    public boolean contains(String s) {
        return "zero" == s || "one" == s || "two" == s || "three" == s
                || "four" == s || "five" == s || "six" == s || "seven" == s
                || "eight" == s || "nine" == s || super.contains(s);
    }
}

Using switch: fastest I've found so far

Since we have a fixed list of Strings known at compile-time, we can take advantage of the possibility that we can use Strings in switch statements.

We can add a case for each String from the fixed list and return true, and add a default case to return false.

class SwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        switch (s) {
            case "zero":
            case "one":
            case "two":
            case "three":
            case "four":
            case "five":
            case "six":
            case "seven":
            case "eight":
            case "nine":
                return true;
            default:
                return false;
        }
    }
}

New finding: Embedded switches (2 switch blocks)

Maaartinus's answer about perfect hashing made me thinking. Even if we have a perfect hash, it still has to run on the whole content of the String provided at runtime which we want to check. So instead we should use something that is available right in the String: its length. Based on the length of the String we use a switch, and inside that switch we use an internal switch only listing the strings with the specified length. With this, we reduce the number of case statements inside a switch:

class EmbeddedSwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        switch (s.length()) {
        case 3:
            switch (s) {
                case "one":
                case "two":
                case "six":
                    return true;
                default:
                    return false;
            }
        case 4:
            switch (s) {
                case "zero":
                case "four":
                case "five":
                case "nine":
                    return true;
                default:
                    return false;
            }
        case 5:
            switch (s) {
                case "three":
                case "seven":
                case "eight":
                    return true;
                default:
                    return false;
            }
        default:
            return false;
        }
    }
}

New finding: CharSwitchChecker: the Champion

This is basically the combination of an improved EmbeddedSwitchChecker and OldCurmudgeon's idea about a state machine: here we use the switch on the first character of the String (but first we check its length), and based on that we either narrowed down to one possible String or if not, we check the 2nd character too in which case the possible String can be only one (and we can decide by calling String.equals()):

class CharSwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        final int length = s.length();
        if (length < 3 || length > 5)
            return false;

        switch (s.charAt(0)) {
        case 'z':
            return "zero".equals(s);
        case 'o':
            return "one".equals(s);
        case 't':
            return s.charAt(1) == 'w' ? "two".equals(s) : "three".equals(s);
        case 'f':
            return s.charAt(1) == 'o' ? "four".equals(s) : "five".equals(s);
        case 's':
            return s.charAt(1) == 'i' ? "six".equals(s) : "seven".equals(s);
        case 'e':
            return "eight".equals(s);
        case 'n':
            return "nine".equals(s);
        }
        return false;
    }
}

Test Results

Here are the test results:

                         TIME        HOW FAST (compared to HashSetChecker)
-----------------------------------------------------------------------------
HashSetChecker:          929 ms       1.00x
HashSetChecker2:         892 ms       1.04x
HashMapChecker:          873 ms       1.06x
TreeSetChecker:         2265 ms       0.41x
OrChecker:              1815 ms       0.51x
RefOrChecker:           1708 ms       0.54x
SwitchChecker:           538 ms       1.73x
EmbeddedSwitchChecker:   467 ms       1.99x
CharSwitchChecker:       436 ms       2.13x

The SwitchChecker solution is about 1.7 times faster, the EmbeddedSwitchChecker is 2 times faster and the champion CharSwitchChecker is about 2.13 times faster than the HashSetChecker implementation. As expected the HashSet with big initial capacity and the HashMap solutions are slightly faster, and all other solutions fall behind.

Complete Runnable Test Program (+All Listed Solutions)

The Complete Runnalbe Test Program plus All listed solutions are here in one box for those who want to try it or experiment with new implementations.

Edit: Following Luiggi Mendoza's suggestion for rules for performing a micro benchmark, I changed the main() method for testing. I execute the whole test twice, and I only analyze the 2nd result. Also since the tests don't create new objects in the loop, I see no reason to call System.gc() whatsoever.

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;

interface Checker {
    String[] VALUES = { "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine" };

    boolean contains(String s);
}

class HashSetChecker implements Checker {
    private final Set<String> set = new HashSet<>(Arrays.asList(VALUES));

    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

class HashSetChecker2 implements Checker {
    private final Set<String> set = new HashSet<>(1000);
    {
        set.addAll(Arrays.asList(VALUES));
    }

    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

class HashMapChecker implements Checker {
    private final Map<String, Object> map = new HashMap<>(1000);
    {
        for (String s : VALUES)
            map.put(s, s);
    }

    @Override
    public boolean contains(String s) {
        return map.containsKey(s);
    }
}

class TreeSetChecker implements Checker {
    private final Set<String> set = new TreeSet<>(Arrays.asList(VALUES));

    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

class OrChecker implements Checker {
    @Override
    public boolean contains(String s) {
        return "zero".equals(s) || "one".equals(s) || "two".equals(s) || "three".equals(s)
                || "four".equals(s) || "five".equals(s) || "six".equals(s) || "seven".equals(s)
                || "eight".equals(s) || "nine".equals(s);
    }
}

class RefOrChecker extends OrChecker {
    @Override
    public boolean contains(String s) {
        return "zero" == s || "one" == s || "two" == s || "three" == s || "four" == s || "five" == s
                || "six" == s || "seven" == s || "eight" == s || "nine" == s || super.contains(s);
    }
}

class SwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        switch (s) {
        case "zero":
        case "one":
        case "two":
        case "three":
        case "four":
        case "five":
        case "six":
        case "seven":
        case "eight":
        case "nine":
            return true;
        default:
            return false;
        }
    }
}

class EmbeddedSwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        switch (s.length()) {
        case 3:
            switch (s) {
            case "one":
            case "two":
            case "six":
                return true;
            default:
                return false;
            }
        case 4:
            switch (s) {
            case "zero":
            case "four":
            case "five":
            case "nine":
                return true;
            default:
                return false;
            }
        case 5:
            switch (s) {
            case "three":
            case "seven":
            case "eight":
                return true;
            default:
                return false;
            }
        default:
            return false;
        }
    }
}

class CharSwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        final int length = s.length();
        if (length < 3 || length > 5)
            return false;

        switch (s.charAt(0)) {
        case 'z':
            return "zero".equals(s);
        case 'o':
            return "one".equals(s);
        case 't':
            return s.charAt(1) == 'w' ? "two".equals(s) : "three".equals(s);
        case 'f':
            return s.charAt(1) == 'o' ? "four".equals(s) : "five".equals(s);
        case 's':
            return s.charAt(1) == 'i' ? "six".equals(s) : "seven".equals(s);
        case 'e':
            return "eight".equals(s);
        case 'n':
            return "nine".equals(s);
        }
        return false;
    }
}

public class CheckerTester {
    private static final String[] TESTS = { "zero", "one", "two", "three", "four", "five", "six", "seven",
            "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen",
            "seventeen", "eighteen", "nineteen",

            new String("zero"), new String("one"), new String("two"), new String("three"),
            new String("four"), new String("five"), new String("six"), new String("seven"),
            new String("eight"), new String("nine"), new String("ten"), new String("eleven"),
            new String("twelve"), new String("thirteen"), new String("fourteen"), new String("fifteen"),
            new String("sixteen"), new String("seventeen"), new String("eighteen"), new String("nineteen") };

    public static void test(Checker checker) {
        final int N = 1_000_000;

        long start = System.nanoTime();

        for (int i = 0; i < N; i++)
            for (String test : TESTS)
                checker.contains(test);

        long end = System.nanoTime();

        System.out.printf("%s: %d ms\n", checker.getClass().getName(), (end - start) / 1_000_000);
    }

    public static void main(String args[]) {
        for (int i = 1; i <= 2; i++) {
            System.out.println("---- Check #" + i);
            test(new HashSetChecker());
            test(new HashSetChecker2());
            test(new HashMapChecker());
            test(new TreeSetChecker());
            test(new OrChecker());
            test(new RefOrChecker());
            test(new SwitchChecker());
            test(new EmbeddedSwitchChecker());
            test(new CharSwitchChecker());
        }
    }

}
like image 137
icza Avatar answered Mar 20 '23 20:03

icza