Hash table implementation

Tags:

hashtable

I just bought a book "C Interfaces and Implementations". in chapter one , it has implemented a "Atom" structure, sample code as follow:

#define NELEMS(x) ((sizeof (x))/(sizeof ((x)[0])))
static struct atom {
    struct atom *link;
    int len;
    char *str;
} *buckets[2048];
static unsigned long scatter[] = {
2078917053, 143302914, 1027100827, 1953210302, 755253631, 2002600785,
1405390230, 45248011, 1099951567, 433832350, 2018585307, 438263339,
813528929, 1703199216, 618906479, 573714703, 766270699, 275680090,
1510320440, 1583583926, 1723401032, 1965443329, 1098183682, 1636505764,
980071615, 1011597961, 643279273, 1315461275, 157584038, 1069844923,
471560540, 89017443, 1213147837, 1498661368, 2042227746, 1968401469,
1353778505, 1300134328, 2013649480, 306246424, 1733966678, 1884751139,
744509763, 400011959, 1440466707, 1363416242, 973726663, 59253759,
1639096332, 336563455, 1642837685, 1215013716, 154523136, 593537720,
704035832, 1134594751, 1605135681, 1347315106, 302572379, 1762719719,
269676381, 774132919, 1851737163, 1482824219, 125310639, 1746481261,
1303742040, 1479089144, 899131941, 1169907872, 1785335569, 485614972,
907175364, 382361684, 885626931, 200158423, 1745777927, 1859353594,
259412182, 1237390611, 48433401, 1902249868, 304920680, 202956538,
348303940, 1008956512, 1337551289, 1953439621, 208787970, 1640123668,
1568675693, 478464352, 266772940, 1272929208, 1961288571, 392083579,
871926821, 1117546963, 1871172724, 1771058762, 139971187, 1509024645,
109190086, 1047146551, 1891386329, 994817018, 1247304975, 1489680608,
706686964, 1506717157, 579587572, 755120366, 1261483377, 884508252,
958076904, 1609787317, 1893464764, 148144545, 1415743291, 2102252735,
1788268214, 836935336, 433233439, 2055041154, 2109864544, 247038362,
299641085, 834307717, 1364585325, 23330161, 457882831, 1504556512,
1532354806, 567072918, 404219416, 1276257488, 1561889936, 1651524391,
618454448, 121093252, 1010757900, 1198042020, 876213618, 124757630,
2082550272, 1834290522, 1734544947, 1828531389, 1982435068, 1002804590,
1783300476, 1623219634, 1839739926, 69050267, 1530777140, 1802120822,
316088629, 1830418225, 488944891, 1680673954, 1853748387, 946827723,
1037746818, 1238619545, 1513900641, 1441966234, 367393385, 928306929,
946006977, 985847834, 1049400181, 1956764878, 36406206, 1925613800,
2081522508, 2118956479, 1612420674, 1668583807, 1800004220, 1447372094,
523904750, 1435821048, 923108080, 216161028, 1504871315, 306401572,
2018281851, 1820959944, 2136819798, 359743094, 1354150250, 1843084537,
1306570817, 244413420, 934220434, 672987810, 1686379655, 1301613820,
1601294739, 484902984, 139978006, 503211273, 294184214, 176384212,
281341425, 228223074, 147857043, 1893762099, 1896806882, 1947861263,
1193650546, 273227984, 1236198663, 2116758626, 489389012, 593586330,
275676551, 360187215, 267062626, 265012701, 719930310, 1621212876,
2108097238, 2026501127, 1865626297, 894834024, 552005290, 1404522304,
48964196, 5816381, 1889425288, 188942202, 509027654, 36125855,
365326415, 790369079, 264348929, 513183458, 536647531, 13672163,
313561074, 1730298077, 286900147, 1549759737, 1699573055, 776289160,
2143346068, 1975249606, 1136476375, 262925046, 92778659, 1856406685,
1884137923, 53392249, 1735424165, 1602280572
};
const char *Atom_new(const char *str, int len) {
    unsigned long h;
    int i;
    struct atom *p;
    assert(str);
    assert(len >= 0);
    for (h = 0, i = 0; i < len; i++)
        h = (h<<1) + scatter[(unsigned char)str[i]];
    h &= NELEMS(buckets)-1;
    for (p = buckets[h]; p; p = p->link)
        if (len == p->len) {
            for (i = 0; i < len && p->str[i] == str[i]; )
                i++;
            if (i == len)
                return p->str;
        }
    p = ALLOC(sizeof (*p) + len + 1);
    p->len = len;
    p->str = (char *)(p + 1);
    if (len > 0)
        memcpy(p->str, str, len);
    p->str[len] = '\0';
    p->link = buckets[h];
    buckets[h] = p;//insert atom in front of list
    return p->str;
}

at end of chapter , in exercises 3.1, the book's author said "Most texts recommend using a prime number for the size of buckets. Using a prime and a good hash function usually gives a better distribution of the lengths of the lists hanging off of buckets. Atom uses a power of two, which is sometimes explicitly cited as a bad choice. Write a program to generate or read, say, 10,000 typical strings and measure Atom_new’s speed and the distribution of the lengths of the lists. Then change buckets so that it has 2,039 entries (the largest prime less than 2,048), and repeat the measurements. Does using a prime help? How much does your conclusion depend on your specific machine?"

so I did changed that hash table size to 2039,but it seems a prime number actually made a bad distribution of the lengths of the lists, I have tried 64, 61, 61 actually made a bad distribution too.

I am just want to know why a prime table size make a bad distribution, is this because the hash function used with Atom_new a bad hash function?

I am using this function to print out the lengths of the atom lists

#define B_SIZE 2048
void Atom_print(void)
{
    int i,t;
    struct atom *atom;
    for(i= 0;i<B_SIZE;i++) {
        t = 0;
        for(atom=buckets[i];atom;atom=atom->link) {
            ++t;
        }
        printf("%d ",t);
    }
}

855

asked Jun 15 '11 22:06

anru

2 Answers

Well, along time ago I had to implement a hash table (in driver development), and I about the same. Why the heck should I use a prime number? OTOH power of 2 is even better - instead of calculating the modulus in case of power of 2 you may use bitwise AND.

So I've implemented such a hash table. The key was a pointer (returned by some 3rd-party function). Then, eventually I noticed that in my hash table only 1/4 of all the entries is filled. Because that hash function I used was identity function, and just in case it turned out that all the returned pointers are multiples of 4.

The idea of using the prime numbers for the hash table size is the following: real-world hash functions do not produce equally-distributed values. Usually there's (or at least there may be) some dependency. So, in order to diffuse this distribution it's recommended to use prime numbers.

BTW, theoretically there may happen that occasionally the hash function will produce the numbers that are multiples of your chosen prime number. But the probability of this is lower than if it was not a prime number.

answered Sep 19 '22 18:09

valdo

I think it's the code to select the bucket. In the code you pasted it says:

h &= NELEMS(buckets)-1;

That works fine for sizes which are powers of two, since its final effect is choosing the lower bits of h. For other sizes, NELEMS(buckets)-1 will have bits in 0 and the bit-wise & operator will discard those bits, effectively leaving "holes" in the bucket list.

The general formula for bucket selection is:

h = h % NELEMS(buckets);

answered Sep 17 '22 18:09

Gustavo Giráldez

Related questions
                            
                                What's the difference between -lcurses and -lncurses when compiling C using ncurses lib?
                            
                                Time into string with HH:MM:SS format (C-programming)
                            
                                How to call popen() with a pathname containing spaces under Windows in C/C++?
                            
                                Debugging unit test in C using check
                            
                                Can a thread be pre-empted in the midst of a system call to kernel?
                            
                                How to profile memory usage of a C program
                            
                                Different math rounding behaviour between Linux, Mac OS X and Windows
                            
                                What is the Effect of Declaring 'extern "C"' in the Header to a C++ Shared Library?
                            
                                Antialiasing algorithm?
                            
                                Linux, timerfd accuracy
                            
                                Read from excel file in C
                            
                                Name for a bitmask with 1-bit set [closed]
                            
                                Performance of fopen() on win32
                            
                                how does do{} while(0) work in macro?
                            
                                Malloc, free and segmentation fault
                            
                                sysinfo system call returns wrong load average values on linux
                            
                                How to generate Unique Order Id (just to show touser) with actual Order Id?
                            
                                how to pass on value to an integer pointer
                            
                                Allocating A Large (5000+) Array
                            
                                MAC Address printing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With