Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Constant-time hash for strings?

Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a mechanism. C++ has its map, LISP has an equivalent, as do most other modern languages.

It was contended in the answers to the question that hash algorithms on strings can be conducted in constant timem with one SO member who has 25 years experience in programming claiming that anything can be hashed in constant time. My personal contention is that this is not true, unless your particular application places a boundary on the string length. This means that some constant K would dictate the maximal length of a string.

I am familiar with the Rabin-Karp algorithm which uses a hashing function for its operation, but this algorithm does not dictate a specific hash function to use, and the one the authors suggested is O(m), where m is the length of the hashed string.

I see some other pages such as this one (http://www.cse.yorku.ca/~oz/hash.html) that display some hash algorithms, but it seems that each of them iterates over the entire length of the string to arrive at its value.

From my comparatively limited reading on the subject, it appears that most associative arrays for string types are actually created using a hashing function that operates with a tree of some sort under the hood. This may be an AVL tree or red/black tree that points to the location of the value element in the key/value pair.

Even with this tree structure, if we are to remain on the order of theta(log(n)), with n being the number of elements in the tree, we need to have a constant-time hash algorithm. Otherwise, we have the additive penalty of iterating over the string. Even though theta(m) would be eclipsed by theta(log(n)) for indexes containing many strings, we cannot ignore it if we are in such a domain that the texts we search against will be very large.

I am aware that suffix trees/arrays and Aho-Corasick can bring the search down to theta(m) for a greater expense in memory, but what I am asking specifically if a constant-time hash method exists for strings of arbitrary lengths as was claimed by the other SO member.

Thanks.

like image 689
San Jacinto Avatar asked Dec 07 '09 18:12

San Jacinto


People also ask

Is hashing a string constant time?

A hash function doesn't have to (and can't) return a unique value for every string. You could use the first 10 characters to initialize a random number generator and then use that to pull out 100 random characters from the string, and hash that. This would be constant time.

What is the time complexity of hashing a string?

One of the most common applications of hashing strings is to compare it. Comparing strings of length N takes O(N) time complexity but comparing integers take O(1) time complexity. Hence, comparing hash of strings take O(1) time complexity.

What will be the time complexity to find out the hash value for all the strings?

The complexity of a hashing function is never O(1). If the length of the string is n then the complexity is surely O(n). However, if you compute all hashes in a given array, you won't have to calculate for the second time and you can always compare two strings in O(1) time by comparing the precalculated hashes.

Can strings be hashed?

The process of hashing in cryptography is to map any string of any given length, to a string with a fixed length. This smaller, fixed length string is known as a hash. To create a hash from a string, the string must be passed into a hash function.


2 Answers

A hash function doesn't have to (and can't) return a unique value for every string.

You could use the first 10 characters to initialize a random number generator and then use that to pull out 100 random characters from the string, and hash that. This would be constant time.

You could also just return the constant value 1. Strictly speaking, this is still a hash function, although not a very useful one.

like image 175
Mark Byers Avatar answered Oct 01 '22 14:10

Mark Byers


In general, I believe that any complete string hash must use every character of the string and therefore would need to grow as O(n) for n characters. However I think for practical string hashes you can use approximate hashes that can easily be O(1).

Consider a string hash that always uses Min(n, 20) characters to compute a standard hash. Obviously this grows as O(1) with string size. Will it work reliably? It depends on your domain...

like image 25
Ron Warholic Avatar answered Oct 01 '22 13:10

Ron Warholic