What is string hashing?

String hashing is the way to convert a string into an integer known as a hash of that string. An ideal hashing is the one in which there are minimum chances of collision (i.e 2 different strings having the same hash). In this hashing technique, the hash of a string is calculated as: Where P and M are some positive numbers.

What is the purpose of using hash functions?

Hash functions for algorithmic use have usually 2 goals, first they have to be fast, second they have to evenly distibute the values across the possible numbers. The hash function also required to give the all same number for the same input value. if your values are strings, here are some examples for bad hash functions:

How do you use hashing in Python?

Hashing Strings with Python. A hash function is a function that takes input of a variable length sequence of bytes and converts it to a fixed length sequence. It is a one way function. This means if f is the hashing function, calculating f(x) is pretty fast and simple, but trying to obtain x again will take years.

How to calculate the hash of a string of length?

Calculation of the hash of a string. The good and widely used way to define the hash of a string of length is hash s s 0 s 1 p s 2 s n 1 mod m s i mod m where and are some chosen, positive numbers. It is called a polynomial rolling hash function. It is reasonable to make a prime number roughly equal to the number...

hash function for string

People also ask

Can you hash a string?

The process of hashing in cryptography is to map any string of any given length, to a string with a fixed length. This smaller, fixed length string is known as a hash. To create a hash from a string, the string must be passed into a hash function.

How do I make a hash for a string?

In order to create a unique hash from a specific string, it can be implemented using their own string to hash converting function. It will return the hash equivalent of a string. Also, a library named Crypto can be used to generate various types of hashes like SHA1, MD5, SHA256 and many more.

What would the hash function return for the string?

A Hash function is a function that maps any kind of data of arbitrary size to fixed-size values. The values returned by the function are called Hash Values or digests.

How do I find the hash value of a string?

Getting the hash code of a string is simple in C#. We use the GetHashCode() method. A hash code is a uniquely identified numerical value. Note that strings that have the same value have the same hash code.

I've had nice results with djb2 by Dan Bernstein.

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

First, you generally do not want to use a cryptographic hash for a hash table. An algorithm that's very fast by cryptographic standards is still excruciatingly slow by hash table standards.

Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do not want the rotation to be an even multiple of the byte size either.

For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:

int hash(char const *input) { 
    int result = 0x55555555;

    while (*input) { 
        result ^= *input++;
        result = rol(result, 5);
    }
}

Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).

Wikipedia shows a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.

uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
    uint32_t hash, i;
    for(hash = i = 0; i < len; ++i)
    {
        hash += key[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the APR and glib, which also provide prebuilt hash functions. I'd highly recommend using those rather than inventing your own hashtable or hash function; they've been optimized heavily for common use-cases.

If your dataset is static, however, your best solution is probably to use a perfect hash. gperf will generate a perfect hash for you for a given dataset.

djb2 has 317 collisions for this 466k english dictionary while MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes). My recommendation is using MurmurHash if available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:

uint32_t inline MurmurOAAT32 ( const char * key)
{
  uint32_t h(3323198485ul);
  for (;*key;++key) {
    h ^= *key;
    h *= 0x5bd1e995;
    h ^= h >> 15;
  }
  return h;
}

uint64_t inline MurmurOAAT64 ( const char * key)
{
  uint64_t h(525201411107845655ull);
  for (;*key;++key) {
    h ^= *key;
    h *= 0x5bd1e9955bd1e995;
    h ^= h >> 47;
  }
  return h;
}

The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear trade-off between space and time complexity for hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.

`djb2` is good

Though djb2, as presented on stackoverflow by cnicutar, is almost certainly better, I think it's worth showing the K&R hashes too:

One of the K&R hashes is terrible, one is probably pretty good:

Apparently a terrible hash algorithm, as presented in K&R 1st edition (source)

unsigned long hash(unsigned char *str)
{
    unsigned int hash = 0;
    int c;

    while (c = *str++)
        hash += c;

    return hash;
}

Probably a pretty decent hash algorithm, as presented in K&R version 2 (verified by me on pg. 144 of the book); NB: be sure to remove % HASHSIZE from the return statement if you plan on doing the modulus sizing-to-your-array-length outside the hash algorithm. Also, I recommend you make the return and "hashval" type unsigned long instead of the simple unsigned (int).
```
unsigned hash(char *s)
{
    unsigned hashval;

    for (hashval = 0; *s != '\0'; s++)
        hashval = *s + 31*hashval;
    return hashval % HASHSIZE;
}
```

Note that it's clear from the two algorithms that one reason the 1st edition hash is so terrible is because it does NOT take into consideration string character order, so hash("ab") would therefore return the same value as hash("ba"). This is not so with the 2nd edition hash, however, which would (much better!) return two different values for those strings.

The GCC C++11 hashing function used by the `std::unordered_map<>` template container hash table is excellent.

The GCC C++11 hashing functions used for unordered_map (a hash table template) and unordered_set (a hash set template) appear to be as follows.

This is a partial answer to the question of what are the GCC C++11 hash functions used, stating that GCC uses an implementation of "MurmurHashUnaligned2", by Austin Appleby (http://murmurhash.googlepages.com/).
In the file "gcc/libstdc++-v3/libsupc++/hash_bytes.cc", here (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc), I found the implementations. Here's the one for the "32-bit size_t" return value, for example (pulled 11 Aug 2017):

Code:

// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
  const size_t m = 0x5bd1e995;
  size_t hash = seed ^ len;
  const char* buf = static_cast<const char*>(ptr);

  // Mix 4 bytes at a time into the hash.
  while (len >= 4)
  {
    size_t k = unaligned_load(buf);
    k *= m;
    k ^= k >> 24;
    k *= m;
    hash *= m;
    hash ^= k;
    buf += 4;
    len -= 4;
  }

  // Handle the last few bytes of the input array.
  switch (len)
  {
    case 3:
      hash ^= static_cast<unsigned char>(buf[2]) << 16;
      [[gnu::fallthrough]];
    case 2:
      hash ^= static_cast<unsigned char>(buf[1]) << 8;
      [[gnu::fallthrough]];
    case 1:
      hash ^= static_cast<unsigned char>(buf[0]);
      hash *= m;
  };

  // Do a few final mixes of the hash.
  hash ^= hash >> 13;
  hash *= m;
  hash ^= hash >> 15;
  return hash;
}

MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 `std::unordered_map<>` hash used above.

Not only is is the best of all of these, but Austin released MurmerHash3 into the public domain. See my other answer on this here: What is the default hash function used in C++ std::unordered_map?.

Related questions
                            
                                What is the format specifier for unsigned short int?
                            
                                How to use shared memory with Linux in C
                            
                                What can be the reasons of connection refused errors?
                            
                                What is the purpose of an 'if (0)' block in if-else block?
                            
                                Why write 1,000,000,000 as 1000*1000*1000 in C?
                            
                                Is it feasible to compile Python to machine code?
                            
                                Why doesn't C have unsigned floats?
                            
                                How to measure time in milliseconds using ANSI C?
                            
                                How do I make an infinite empty loop that won't be optimized away?
                            
                                self referential struct definition?
                            
                                stdlib and colored output in C
                            
                                Can a recursive function be inline?
                            
                                What does dot (.) mean in a struct initializer?
                            
                                LLVM vs clang on OS X
                            
                                error: Libtool library used but 'LIBTOOL' is undefined
                            
                                "#include" a text file in a C program as a char[]
                            
                                What is the proper #include for the function 'sleep()'?
                            
                                Why would anybody use C over C++? [closed]
                            
                                Why does C++ rand() seem to generate only numbers of the same order of magnitude?
                            
                                DESTDIR and PREFIX of make

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

hash function for string

Tags:

c

dictionary

algorithm

hashtable

hash

People also ask

`djb2` is good

One of the K&R hashes is terrible, one is probably pretty good:

The GCC C++11 hashing function used by the `std::unordered_map<>` template container hash table is excellent.

MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 `std::unordered_map<>` hash used above.

Recent Activity

Donate For Us

hash function for string

Tags:

c

dictionary

algorithm

hashtable

hash

People also ask

djb2 is good

One of the K&R hashes is terrible, one is probably pretty good:

The GCC C++11 hashing function used by the std::unordered_map<> template container hash table is excellent.

MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 std::unordered_map<> hash used above.

Related questions

Recent Activity

Donate For Us

`djb2` is good

The GCC C++11 hashing function used by the `std::unordered_map<>` template container hash table is excellent.

MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 `std::unordered_map<>` hash used above.