Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mapping from String to integer - performance of various approaches

Let's say that I need to make a mapping from String to an integer. The integers are unique and form a continuous range starting from 0. That is:

Hello -> 0
World -> 1
Foo   -> 2
Bar   -> 3
Spam  -> 4
Eggs  -> 5
etc.

There are at least two straightforward ways to do it. With a hashmap:

HashMap<String, Integer> map = ...
int integer = map.get(string); // Plus maybe null check to avoid NPE in unboxing.

Or with a list:

List<String> list = ...
int integer = list.indexOf(string); // Plus maybe check for -1.

Which approach should I use, and why? Arguably the relative performance depends on the size of the list/map, since List#indexOf() is a linear search using String#equals() -> O(n) efficiency, while HashMap#get() uses hash to narrow down the search -> certainly more efficient when the map is big, but maybe inferior when there are just few elements (there must be some overhead in calculating the hash, right?).

Since benchmarking Java code properly is notoriously hard, I would like to get some educated guesses. Is my reasoning above correct (list is better for small, map is better for large)? What is the threshold size approximately? What difference do various List and HashMap implementations make?

like image 408
Joonas Pulakka Avatar asked Oct 21 '10 09:10

Joonas Pulakka


3 Answers

A third option and possibly my favorite would be to use a trie:

                    

I bet it beats the HashMap in performance (no collisions + the fact that computing the hash-code is O(length of string) anyway), and possibly also the List approach in some cases (such as if your strings have long common prefixes, as the indexOf would waste lot of time in the equals methods).

When choosing between List and Map I would go for a Map (such as HashMap). Here is my reasoning:

  • Readability

    The Map interface simply provides a more intuitive interface for this use case.

  • Optimization in the right place

    I'd say if you're using a List you would be optimizing for the small cases anyway. That's probably not where the bottle neck is.

A fourth option would be to use a LinkedHashMap, iterate through it if the size is small, and get the associated number if the size is large.

A fifth option is to encapsulate the decision in a separate class all together. In this case you could even implement it to change strategy in runtime as the list grows.

like image 68
aioobe Avatar answered Nov 06 '22 23:11

aioobe


You're right: a List would be O(n), a HashMap would be O(1), so a HashMap would be faster for n large enough so that the time to calculate the hash didn't swamp the List linear search.

I don't know the threshold size; that's a matter for experimentation or better analytics than I can muster right now.

like image 4
duffymo Avatar answered Nov 06 '22 23:11

duffymo


Your question is totally correct on all points:

  • HashMaps are better (they use a hash)
  • Benchmarking Java code is hard

But at the end of the day, you're just going to have to benchmark your particular application. I don't see why HashMaps would be slower for small cases but the benchmarking will give you the answer if it is or not.

One more option, a TreeMap is another map data structure which uses a tree as opposed to a hash to access the entries. If you are doing benchmarking, you might as well benchmark that as well.

Regarding benchmarking, one of the main problems is the garbage collector. However if you do a test which doesn't allocate any objects, that shouldn't be a problem. Fill up your map/list, then just write a loop to get N random elements, and then time it, that should be reasonably reproducable and therefore informative.

like image 4
Adrian Smith Avatar answered Nov 06 '22 23:11

Adrian Smith