Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of hash table, why is C++ the slowest?

Ran a simple performance test on hash, it appears C++ version is the slower than both the perl version and the golang version.

  • perl version took about 200 millisecond,
  • C++ version took 280 millisecond.
  • golang version took 56 millisecond.

On my PC with Core(TM) i7-2670QM CPU @ 2.20GHz, Ubuntu 14.04.3LTS,

Any ideas?

perl version

use Time::HiRes qw( usleep ualarm gettimeofday tv_interval nanosleep
                      clock_gettime clock_getres clock_nanosleep clock
                      stat );
sub getTS {
    my ($seconds, $microseconds) = gettimeofday;
    return $seconds + (0.0+ $microseconds)/1000000.0;
}
my %mymap;
$mymap{"U.S."} = "Washington";
$mymap{"U.K."} = "London";
$mymap{"France"} = "Paris";
$mymap{"Russia"} = "Moscow";
$mymap{"China"} = "Beijing";
$mymap{"Germany"} = "Berlin";
$mymap{"Japan"} = "Tokyo";
$mymap{"China"} = "Beijing";
$mymap{"Italy"} = "Rome";
$mymap{"Spain"} = "Madrad";
$x = "";
$start = getTS();
for ($i=0; $i<1000000; $i++) {
    $x = $mymap{"China"};
}
printf "took %f sec\n", getTS() - $start;

C++ version

#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec/1000000.0;
}
using namespace std;
int main () {
  std::unordered_map<std::string,std::string> mymap;

  // populating container:
    mymap["U.S."] = "Washington";
    mymap["U.K."] = "London";
    mymap["France"] = "Paris";
    mymap["Russia"] = "Moscow";
    mymap["China"] = "Beijing";
    mymap["Germany"] = "Berlin";
    mymap["Japan"] = "Tokyo";
    mymap["China"] = "Beijing";
    mymap["Italy"] = "Rome";
    mymap["Spain"] = "Madrad";  

  double start = getTS();
  string x;
  for (int i=0; i<1000000; i++) {
      mymap["China"];
  }
  printf("took %f sec\n", getTS() - start);
  return 0;
}

Golang version

package main

import "fmt"
import "time"

func main() {
    var x string
    mymap := make(map[string]string)
    mymap["U.S."] = "Washington";
    mymap["U.K."] = "London";
    mymap["France"] = "Paris";
    mymap["Russia"] = "Moscow";
    mymap["China"] = "Beijing";
    mymap["Germany"] = "Berlin";
    mymap["Japan"] = "Tokyo";
    mymap["China"] = "Beijing";
    mymap["Italy"] = "Rome";
    mymap["Spain"] = "Madrad";
    t0 := time.Now()
    sum := 1
    for sum < 1000000 {
        x = mymap["China"]
        sum += 1
    }
    t1 := time.Now()
    fmt.Printf("The call took %v to run.\n", t1.Sub(t0))
    fmt.Println(x)
}

Update 1

To improve C++ version, changed x = mymap["China"]; to mymap["China"];, but there is very little difference in performance.

Update 2

I got the original result when compiling without any optimization: g++ -std=c++11 unorderedMap.cc. With "-O2" optimization, it cost only about half the time (150ms)

Update 3

To remove the possible char* to string constructor call, I created a string constant. The time comes down to about 220ms (no optimization in compiling). Thanks to the suggestion from @neil-kirk, with optimization, (-O2 flag), the time is about 80ms.

  double start = getTS();
  string x = "China";
  for (int i=0; i<1000000; i++) {
      mymap[x];
  }

Update 4

Thanks to @steffen-ullrich who pointed out there is a syntax error for perl version. I Changed it. The performance number is about 150ms.

Update 5

It appears the number of executed instructions matters. Using command valgrind --tool=cachegrind <cmd>

For Go version

$ valgrind --tool=cachegrind  ./te1
==2103== Cachegrind, a cache and branch-prediction profiler
==2103== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2103== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2103== Command: ./te1
==2103== 
--2103-- warning: L3 cache found, using its data for the LL simulation.
The call took 1.647099s to run.
Beijing
==2103== 
==2103== I   refs:      255,763,381
==2103== I1  misses:          3,709
==2103== LLi misses:          2,743
==2103== I1  miss rate:        0.00%
==2103== LLi miss rate:        0.00%
==2103== 
==2103== D   refs:      109,437,132  (77,838,331 rd   + 31,598,801 wr)
==2103== D1  misses:        352,474  (   254,714 rd   +     97,760 wr)
==2103== LLd misses:        149,260  (    96,250 rd   +     53,010 wr)
==2103== D1  miss rate:         0.3% (       0.3%     +        0.3%  )
==2103== LLd miss rate:         0.1% (       0.1%     +        0.1%  )
==2103== 
==2103== LL refs:           356,183  (   258,423 rd   +     97,760 wr)
==2103== LL misses:         152,003  (    98,993 rd   +     53,010 wr)
==2103== LL miss rate:          0.0% (       0.0%     +        0.1%  )

For C++ optimized version (no optimization flag)

$ valgrind --tool=cachegrind ./a.out
==2180== Cachegrind, a cache and branch-prediction profiler
==2180== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2180== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2180== Command: ./a.out
==2180== 
--2180-- warning: L3 cache found, using its data for the LL simulation.
took 64.657681 sec
==2180== 
==2180== I   refs:      5,281,474,482
==2180== I1  misses:            1,710
==2180== LLi misses:            1,651
==2180== I1  miss rate:          0.00%
==2180== LLi miss rate:          0.00%
==2180== 
==2180== D   refs:      3,170,495,683  (1,840,363,429 rd   + 1,330,132,254 wr)
==2180== D1  misses:           12,055  (       10,374 rd   +         1,681 wr)
==2180== LLd misses:            7,383  (        6,132 rd   +         1,251 wr)
==2180== D1  miss rate:           0.0% (          0.0%     +           0.0%  )
==2180== LLd miss rate:           0.0% (          0.0%     +           0.0%  )
==2180== 
==2180== LL refs:              13,765  (       12,084 rd   +         1,681 wr)
==2180== LL misses:             9,034  (        7,783 rd   +         1,251 wr)
==2180== LL miss rate:            0.0% (          0.0%     +           0.0%  )

For C++ optimized version

$ valgrind --tool=cachegrind ./a.out
==2157== Cachegrind, a cache and branch-prediction profiler
==2157== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2157== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2157== Command: ./a.out
==2157== 
--2157-- warning: L3 cache found, using its data for the LL simulation.
took 9.419447 sec
==2157== 
==2157== I   refs:      1,451,459,660
==2157== I1  misses:            1,599
==2157== LLi misses:            1,549
==2157== I1  miss rate:          0.00%
==2157== LLi miss rate:          0.00%
==2157== 
==2157== D   refs:        430,486,197  (340,358,108 rd   + 90,128,089 wr)
==2157== D1  misses:           12,008  (     10,337 rd   +      1,671 wr)
==2157== LLd misses:            7,372  (      6,120 rd   +      1,252 wr)
==2157== D1  miss rate:           0.0% (        0.0%     +        0.0%  )
==2157== LLd miss rate:           0.0% (        0.0%     +        0.0%  )
==2157== 
==2157== LL refs:              13,607  (     11,936 rd   +      1,671 wr)
==2157== LL misses:             8,921  (      7,669 rd   +      1,252 wr)
==2157== LL miss rate:            0.0% (        0.0%     +        0.0%  )
like image 529
packetie Avatar asked Nov 27 '15 05:11

packetie


People also ask

Why is hash table slow?

Hashtable is so slow because both the 'get()' and 'put()' methods on this object are synchronized (if you are interested, you can see the Hashtable source code here).

What affects the efficiency of hash tables?

One the important factor is the quality of hash function. Also the number element stored in the hash table effect the overall efficiency of the lookup operations, Moreover, number of buckets in the hash table also effects in this regard.

What has the greatest effect on hash table performance?

The most memory efficient datastructure for associations The hash table with the best memory efficiency is simply the one with the highest load factor, (it can even exceed 100% memory efficiency by using key compression with compact hashing ). A hash table like that does still provide O(1) lookups, just very slow.

Which is faster array or Hashtable?

Hash tables tend to be faster when it comes to searching for items. In arrays, you have to loop over all items before you find what you are looking for while in a hash table you go directly to the location of the item. Inserting an item is also faster in Hash tables since you just hash the key and insert it.


1 Answers

From your Perl code (before your attempts to fix it):

@mymap = ();
$mymap["U.S."] = "Washington";
$mymap["U.K."] = "London";

This is not a map but an array. The syntax for hash maps is:

  %mymap;
  $mymap{"U.S."} = ....

Thus what you effectively do is to create an array and not a hash map and access the element 0 all the time. Please, use use strict; and use warnings; all the time with Perl and even a simple syntax check with warnings would have shown you the problem:

perl -cw x.pl
Argument "U.S." isn't numeric in array element at x.pl line 9.
Argument "U.K." isn't numeric in array element at x.pl line 10.

Apart from that the main part of the benchmark effectively does nothing useful (assign a variable and never use it) and some compilers can detect it and simply optimize it away.

If you would check the code generated by your Perl program you would see:

$ perl -MO=Deparse x.pl
@mymap = ();
$mymap[0] = 'Washington';
$mymap[0] = 'London';
...
for ($i = 0; $i < 1000000; ++$i) {
    $x = $mymap[0];
}

That is it detects the problem at compile time and replaces it with an access to array index 0.

Thus whenever do benchmarks you need to:

  • Check that your program actually does what it is supposed to.
  • Check that the compiler does not optimize things away and does not execute stuff at compile time where other languages will do it at run time. Any kind of statements with no or predictable results are prone to such optimization.
  • Verify that you actually measure what you intend to measure and not something else. Even small changes to the program can affect the run time because there is a memory allocation needed were not was before etc and these changes might not be related to what you intend to measure. In your benchmark you measure the access to the same hash element again and again without any access to other elements in between. This is an activity which can play very nice with processor caches but probably does not reflect real world use.

And, using a simple timer is not a realistic benchmark. There are other processes on the system, there is the scheduler, there is cache trashing... and with today's CPU it highly depends on the load on the system because maybe the CPU will run one benchmark in a lower power mode than the other benchmarks, i.e. with a different CPU clock. For example multiple runs of the same "benchmark" vary in the measured time between 100ms and 150ms on my system.

Benchmarks are lies and micro benchmarks like yours doubly so.

like image 174
Steffen Ullrich Avatar answered Oct 15 '22 04:10

Steffen Ullrich