These notes are intended for use by students in cs1501 at the University of Pittsburgh and no one else

 Page 6/24 Date 03.05.2017 Size 186.06 Kb. #18863

Hashing

• Insert
• i = h(x);
• T[i] = x;
• Find
• i = h(x);
• if (T[i] == x) return true;
• else return false;
• This is the simplistic idea of hashing
• Why simplistic?
• What are we ignoring here?
• Discuss

Collisions

• Simple hashing fails in the case of a collision:
• h(x1) == h(x2), where x1 != x2
• Can we avoid collisions (i.e. guarantee that they do not occur)?
• Yes, but only when size of the key space, K, is less than or equal to the table size, M
• When |K| <= M there is a technique called perfect hashing that can ensure no collisions
• It also works if N <= M, but the keys are known in advance, which in effect reduces the key space to N
• Ex: Hashing the keywords of a programming language during compilation of a program

Collisions

• When |K| > M, by the pigeonhole principle, collisions cannot be eliminated
• We have more pigeons (potential keys) than we have pigeonholes (table locations), so at least 2 pigeons must share a pigeonhole
• Unfortunately, this is usually the case
• For example, an employer using SSNs as the key
• Let M = 1000 and N = 500
• It seems like we should be able to avoid collisions, since our table will not be full
• However, |K| = 109 since we do not know what the 500 keys will be in advance (employees are hired and fired, so in fact the keys change)

Resolving Collisions

• So we must redesign our hashing operations to work despite collisions
• We call this collision resolution
• Two common approaches:
• If a collision occurs at index i in the table, try alternative index values until the collision is resolved
• Thus a key may not necessarily end up in the location that its hash function indicates
• We must choose alternative locations in a consistent, predictable way so that items can be located correctly
• Our table can store at most M keys

Resolving Collisions

• Each index i in the table represents a collection of keys
• Thus a collision at location i simply means that more than one key will be in or searched for within the collection at that location
• The number of keys that can be stored in the table depends upon the maximum size allowed for the collections

Reducing the number of collisions

• Before discussing resolution in detail
• Can we at least keep the number of collisions in check?
• Yes, with a good hash function
• The goal is to make collisions a "random" occurrence
• Collisions will occur, but due to chance, not due to similarities or patterns in the keys
• What is a good hash function?
• It should utilize the entire key (if possible) and exploit any differences between keys

Reducing the number of collisions

• Let's look at some examples
• Consider hash function for Pitt students based on phone numbers
• Bad: First 3 digits of number
• Discuss
• Better?
• See board
• Consider hash function for words

Good Hashing

• Generally speaking we should:
• Choose M to be a prime number
• Calculate our hash function as
• h(x) = f(x) mod M
• where f(x) is some function that converts x into a large "random" integer in an intelligent way
• It is not actually random, but the idea is that if keys are converted into very large integers (much bigger than the number of actual keys) collisions will occur because of pigeonhole principle, but they will be less frequent

Collision Resolution

• Back to Collision Resolution
• The simplest open addressing scheme is Linear Probing
• Idea: If a collision occurs at location i, try (in sequence) locations i+1, i+2, … (mod M) until the collision is resolved
• For Insert:
• Collision is resolved when an empty location is found
• For Find:
• Collision is resolved (found) when the item is found
• Collision is resolved (not found) when an empty location is found, or when index circles back to i
• Look at an example

Collision Resolution

• Performance
• Theta(1) for Insert, Search for normal use, subject to the issues discussed below
• In normal use at most a few probes will be required before a collision is resolved
• Linear probing issues
• What happens as table fills with keys?
• Define LOAD FACTOR,  = N/M
• How does  affect linear probing performance?
• Consider a hash table of size M that is empty, using a good hash function
• Given a random key, x, what is the probability that x will be inserted into any location i in the table?
• 1/M

Collision Resolution

• Consider now a hash table of size M that has a cluster of C consecutive locations that are filled
• Now given a random key, x, what is the probability that x will be inserted into the location immediately following the cluster?
• Discuss
• How can we "fix" this problem?
• Even AFTER a collision, we need to make all of the locations available to a key
• This way, the probability from filled locations will be redistributed throughout the empty locations in the table, rather than just being pushed down to the first empty location after the cluster
• Discuss
• (C+1)/M