In a hash table, data is stored in an array format, where each data value has its own unique index value. Access of data becomes very fast if we know the index of the desired data. Thus, it becomes a data structure in which insertion and search operations are very fast irrespective of the size of the data.
There are four main characteristics of a good hash function: 1) The hash value is fully determined by the data being hashed. 2) The hash function uses all the input data. 3) The hash function "uniformly" distributes the data across the entire set of possible hash values.
A hash table, also known as a hash map, is a data structure that maps keys to values. It is one part of a technique called hashing, the other of which is a hash function. A hash function is an algorithm that produces an index of where a value can be found or stored in the hash table.
A hash table is a data structure that you can use to store data in key-value format with direct access to its items in constant time. Hash tables are said to be associative, which means that for each key, data occurs at most once.
The answers so far have helped to define hash tables and explain some theory, but I think an example may help you get a better feeling for them.
What is the difference between a hash table and just a normal array?
A hash table and an array are both structures that allow you to store and retrieve data. Both allow you to specify an index and retrieve a value associated with it. The difference, as Daniel Spiewak noted, is that the indices of an array are sequential, while those of a hash table are based on the value of the data associated with them.
Why would I use a hash table?
A hash table can provide a very efficient way to search for items in large amounts of data, particularly data that is not otherwise easily searchable. ("Large" here means ginormous, in the sense that it would take a long time to perform a sequential search).
If I were to code a hash how would I even begin?
No problem. The simplest way is to invent an arbitrary mathematical operation that you can perform on the data, that returns a number N
(usually an integer). Then use that number as the index into an array of "buckets" and store your data in bucket #N
. The trick is in selecting an operation that tends to place values in different buckets in a way that makes it easy for your to find them later.
Example: A large mall keeps a database of its patrons' cars and parking locations, to help shoppers remember where they parked. The database stores make
, color
, license plate
, and parking location
. On leaving the store a shopper finds his car by entering the its make and color. The database returns a (relatively short) list of license plates and parking spaces. A quick scan locates the shopper's car.
You could implement this with an SQL query:
SELECT license, location FROM cars WHERE make="$(make)" AND color="$(color)"
If the data were stored in an array, which is essentially just a list, you can imagine implementing the query by scanning an array for all matching entries.
On the other hand, imagine a hash rule:
Add the ASCII character codes of all the letters in the make and color, divide by 100, and use the remainder as the hash value.
This rule will convert each item to a number between 0 and 99, essentially sorting the data into 100 buckets. Each time a customer needs to locate a car, you can hash the make and color to find the one bucket out of 100 that contains the information. You've immediately reduced the search by a factor of 100!
Now scale the example to huge amounts of data, say a database with millions of entries that is searched based on tens of criteria. A "good" hash function will distribute the data into buckets in a way that minimizes any additional searching, saving a significant amount of time.
First, you have to understand a what a hash function is. A hash function is a function that takes a key (for example, an string of arbritrary length) and returns a number as unique as possible. The same key must always return the same hash. A really simple string hashing function in java might look like
public int stringHash(String s) {
int h = s.length();
for(char c : s.toCharArray()) {
h ^= c;
}
return h;
}
You can study a good hash function at http://www.azillionmonkeys.com/qed/hash.html
Now, the hash map uses this hash value to place the value into an array. Simplistic java method:
public void put(String key, Object val) {
int hash = stringHash(s) % array.length;
if(array[hash] == null) {
array[hash] = new LinkedList<Entry<String, Object> >();
}
for(Entry e : array[hash]) {
if(e.key.equals(key)){
e.value = val;
return;
}
}
array[hash].add(new Entry<String, Object>(key, val));
}
(This map enforces unique keys. Not all maps do.)
It is possible for two different keys to hash to the same value, or two different hashes to map to the same array index. There exists many techniques for dealing with this. The simplest is to use a linked list (or binary tree) for each array index. If the hash function is good enough, you will never need a linear search.
Now to look up a key:
public Object get(String key) {
int hash = stringHash(key) % array.length;
if(array[hash] != null) {
for(Entry e : array[hash]) {
if(e.key.equals(key))
return e.value;
}
}
return null;
}
Hashtables are associative. This is a huge difference from arrays, which are just linear data structures. With an array, you might do something like this:
int[] arr = ...
for (int i = 0; i < arr.length; i++) {
System.out.println(arr[i] + 1);
}
Notice how you are getting an element out of the array by specifying an exact memory offset (i
). This contrasts with hashtables, which allow you to store key/value pairs, later retrieving the value based on the key:
Hashtable<String, Integer> table = new Hashtable<String, Integer>();
table.put("Daniel", 20);
table.put("Chris", 18);
table.put("Joseph", 16);
With the above table, we can make the following call:
int n = table.get("Chris");
...and be assured that n
will be valued at 18
.
I think this will probably answer most of your questions. The implementation of a hashtable is a fairly interesting topic, one which Wikipedia addresses passably well.
"I'm more interested in the way Hash Tables look up the key and how the key is generated."
Hashing transforms a key object to a number. This is called "hashing" -- it makes a hash out of the object. See Hash Function. Summing the bytes of a string, for example, is a standard hash technique. You compute the sum modulo 232 to keep the hash to a manageable size. Hash always gives the same answer. This is O(1).
The number gives you a "slot" in the HashTable. Given an arbitrary key object, the hash value computes a hash value. The hash value then gives you the slot in table. Usually mod( hash, table size )
. This is O(1), also.
That's the general solution. Two numeric calculations and you've gone from arbitrary object as key to arbitrary object as value. Few things can be as fast.
The transformation from object to hash value happens in one of these common ways.
If it's a "primitive" object of 4 bytes, then the object's native value is a number.
The object's address is 4 bytes, then the object's address can be used as a hash value.
A simple hash function (MD5, SHA1, whatever) accumulates the bytes of the object to create a 4-byte number. The advanced hashes aren't simple sums of bytes, a simple sum doesn't reflect all the original input bits fairly enough.
The slot in the hash table is mod( number, size of table ).
If that slot has the desired value, you're done. If that's not the desired value, you need to look somewhere else. There are several popular probing algorithms to look for a free spot in the table. Linear is a simple search for the next free spot. Quadratic is a non-linear hopping around looking for a free slot. A random number generator (with a fixed seed) can be used to generate a series of probes that will spread data evenly but arbitrarily.
The probing algorithms are not O(1). If the table's big enough, the odds of collision are low, and probes don't matter. If the table's too small, then collisions happen and probing happens. At that point, it becomes a matter of "tuning and tweaking" to balance probing and table size to optimize performance. Usually we just make the table bigger.
See Hash Table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With