Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java hashmap vs hashset performance

I have a file consisting of 7.6M lines. Each line is of the form: A,B,C,D where B,C,D are values that are used to calculate a level of importance for A which is a String identifier that is unique for each line. My approach:

private void read(String filename) throws Throwable {
        BufferedReader br  = new BufferedReader(new FileReader(filename));

        Map<String, Double> mmap = new HashMap<>(10000000,0.8f);
        String line;
        long t0 = System.currentTimeMillis();
        while ((line = br.readLine()) != null) {
            split(line);
            mmap.put(splitted[0], 0.0);
        }
        long t1 = System.currentTimeMillis();
        br.close();
        System.out.println("Completed in " + (t1 - t0)/1000.0 + " seconds");
}

private void split(String line) {
    int idxComma, idxToken = 0, fromIndex = 0;
    while ((idxComma = line.indexOf(delimiter, fromIndex)) != -1) {
        splitted[idxToken++] = line.substring(fromIndex, idxComma);
        fromIndex = idxComma + 1;
    }
    splitted[idxToken] = line.substring(fromIndex);
}

where the dummy value 0.0 is inserted for "profiling" purposes and splitted is a simple String array defined for the class. I initially worked with String's split() method, but found the above to be be faster.

When I run the above code, it takes 12 seconds to parse the file which is waaaay more than I think it should take. If I, e.g., replace the HashMap with a Vector of strings and just take the first entry from each line (i.e. I do not put an associated value with it as this should be amortized constant), the entire file can be read in less than 3 seconds.

This suggests to me that (i) there are a lot of collisions in the HashMap (I have tried to minimise the number of resizes by preallocating the size and setting the load factor accordingly) or (ii) the hashCode() function is somehow slow. I doubt its (ii) because if I use a HashSet the files can be read in under 4 seconds.

My question is: What could be the reason that the HashMap performs so slowly? Is the hashCode() insufficient for maps of this size, or is there something fundamentally that I have overlooked?

like image 277
user1938803 Avatar asked Mar 01 '17 11:03

user1938803


2 Answers

HashMap vs Vector: Inserting in HashMap is way more costlier than inserting in Vector. Although both are amortized constant time operations, but the HashMap performs a number of other operations internally (like generating hashCode, checking collissions, resolving collissions, etc), whereas the Vector just inserts the element at the end (increasing the size of the structure, if required).

HashMap vs HashSet: HashSet internally uses HashMap. So, there shouldn't be any performance difference whatsoever if you use them for the same purpose. Ideally, both of these have different purposes, so the discussion regarding which is better is useless.

Since, you need B,C,D as value for A as key, you should definitely stick to HashMap. If you really want to just compare the performance, put "null" instead of 0.0 as the value for all the keys (because that is what HashSet uses while putting the keys in its backed HashMap).

Update: HashSet uses a dummy constant value (static final) to insert in the HashMap, and not null. Sorry about that. You can replace your 0.0 with any constant and the performance should be similar to HashSet.

like image 200
iavanish Avatar answered Sep 29 '22 06:09

iavanish


You could use a more memory-efficient Collections library.

I suggest the Eclipse Collections ( https://www.eclipse.org/collections/ ), that has a ObjectDoubleMap ( https://www.eclipse.org/collections/javadoc/8.0.0/org/eclipse/collections/api/map/primitive/ObjectDoubleMap.html ), which is a map of object (String in your case) that has a double (yes, primitive double) as associated value. It is much better in handling memory and in performance.

You can get an empty instance of this by doing:

ObjectDoubleMaps.mutable.empty();
like image 45
Boschi Avatar answered Sep 29 '22 06:09

Boschi