Skip to main content
ICT
Lesson AB32 - Hash-Coded Data Storage
 
Main   Previous Next
 

B. Computing Hash Codes page 4 of 9
  1. These “key-to-address transformations” are called hashing functions or hashing algorithms. When the key is composed of character data, a numerical equivalent such as the ASCII code is used so that the mathematical processing can take place. To compute a hash code from a string, you need to combine the character values of the string to yield some integer. You could, for example, add up the character values:

    int h = 0;
    for ( int i = 0; i < s.length(); i++ )
      h = h + s.charAt( i );

    However, this would not be a good idea. It doesn't scramble the character values enough. Strings that are permutations of another (such as "eat" and "tea") all have the same hash code.

  2. Here is the method the standard library uses to compute the hash code for a string.

    final int HASH_MULTIPLIER = 31;
    int h = 0;
    for ( int i = 0; i < s.length(); i++ )
      h = HASH_MULTIPLIER * h + s.charAt( i );

    For example, the hash code of "eat" is

    31 * (31 * 'e' + 'a') + 't' = 100184

    The hash code of "tea" is quite different, namely

    31 * (31 * 't' + 'e') + 'a' = 114704

    (A Unicode table is used to look up the character values: 'a' is 97, 'e' is 101, and 't' is 116.)

  3. For your own classes, you should make up a hash code that combines the hash codes of the instance fields in a similar way. For example, let us define a hashCode method for a GroceryItem class. There are two instance fields: the grocery item name and the barcode value. First, compute their hash code. You know how to compute the hash code of a string. To compute the hash code of an integer, first wrap the integer number into an Integer object, and then compute the hash code.

    class GroceryItem
    {
      private String itemName;
      private int itemBarCode;
      . . .
      public int hashCode( )
      {
        int h1 = itemName.hashCode( );
        int h2 = new Integer( itemBarCode ).hashCode( );  // could just use itemBarCode
        . . .
      }
    }

    Then combine the two hash codes.

    final int HASH_MULTIPLIER = 29;
    int h = HASH_MULTIPLIER * h1 + h2;
    return h;

    Use a prime number as the hash multiplier—it scrambles the values better

    If you have more than two instance fields, then combine their hash codes as follows:

    int h = HASH_MULTIPLIER * h1 + h2;
    h = HASH_MULTIPLIER * h + h3;
    h = HASH_MULTIPLIER * h + h4;
    . . .
    return h;

    This process can be used for any primtive data type. Also note that if one of the instance fields is an integer, you could just use the field value as its hash code.

  4. When you add object of your class into a hash table, you need to double-check that the hashCode method is compatible with the equals method of you class. Two objects that are equal must yield the same hash code:

    •  if x.equals(y), then x.hashcode() == y.hashCode()
  5. It is important to develop a good hashing function that avoids collisions in the hash table. Even when using a prime number for the divisor, it is possible for two bar codes to result in the same address. To reduce the chances of such a "collision," research shows that the best size for a hash table is a prime number larger than the number of elements your table is expected to hold. An excess of approximately 30% is typical. This means that, if s is the number of slots in the table and e is the number of elements, then

    s = a prime number ≥ 4/3 * e

    For example, if the number of items in our database were 10,000 and a we chose a hash table size of 10,000 the likelihood of collisions would be increased. Using the above method, we would determine a hash table size as follows:

    s = a prime number ≥ 4/3 * 10000
    s = a prime number ≥ 13333
    s = 13337

    Try to balance the need for decreasing the number of collisions against memory limitations, hence the recommended sizing.

  6. This advance sizing of the hash table affects the mathematics of the hashing algorithm; therefore the programmer must have a very clear idea of the number of items to be stored. The number of items must be known in advance and this number must be fairly constant during the life of the program. This limits the use of hashing to certain situations. If the number of items is unknown or varies greatly, hashing is inappropriate.

 

Main   Previous Next
Contact
 © ICT 2006, All Rights Reserved.