Page 6/24 Date 03.05.2017 Size 186.06 Kb. #18863

Hashing Insert Find i = h(x); if (T[i] == x) return true ; else return false; This is the simplistic idea of hashing Why simplistic? What are we ignoring here? Simple hashing fails in the case of a collision: h(x1) == h(x2), where x1 != x2 Can we avoid collisions (i.e. guarantee that they do not occur)? Yes, but only when size of the key space, K , is less than or equal to the table size, M When |K| <= M there is a technique called perfect hashing that can ensure no collisions It also works if N <= M, but the keys are known in advance, which in effect reduces the key space to N Ex: Hashing the keywords of a programming language during compilation of a program Collisions When |K| > M, by the pigeonhole principle , collisions cannot be eliminated We have more pigeons (potential keys) than we have pigeonholes (table locations), so at least 2 pigeons must share a pigeonhole Unfortunately, this is usually the case For example, an employer using SSNs as the key Let M = 1000 and N = 500 It seems like we should be able to avoid collisions, since our table will not be full However, |K| = 109 since we do not know what the 500 keys will be in advance (employees are hired and fired, so in fact the keys change) Resolving Collisions So we must redesign our hashing operations to work despite collisions We call this collision resolution Two common approaches: Open addressing If a collision occurs at index i in the table, try alternative index values until the collision is resolved Thus a key may not necessarily end up in the location that its hash function indicates We must choose alternative locations in a consistent, predictable way so that items can be located correctly Our table can store at most M keys Resolving Collisions Closed addressing Each index i in the table represents a collection of keys Thus a collision at location i simply means that more than one key will be in or searched for within the collection at that location The number of keys that can be stored in the table depends upon the maximum size allowed for the collections Reducing the number of collisions Before discussing resolution in detail Can we at least keep the number of collisions in check? Yes, with a good hash function The goal is to make collisions a "random" occurrence Collisions will occur , but due to chance, not due to similarities or patterns in the keys What is a good hash function? It should utilize the entire key (if possible) and exploit any differences between keys Reducing the number of collisions Let's look at some examples Consider hash function for Pitt students based on phone numbers Bad: First 3 digits of number Better? Consider hash function for words Good Hashing Generally speaking we should: Choose M to be a prime number Calculate our hash function as h(x) = f(x) mod M where f(x) is some function that converts x into a large "random" integer in an intelligent way It is not actually random, but the idea is that if keys are converted into very large integers (much bigger than the number of actual keys) collisions will occur because of pigeonhole principle, but they will be less frequent Collision Resolution Back to Collision Resolution Open Addressing The simplest open addressing scheme is Linear Probing Idea: If a collision occurs at location i, try (in sequence) locations i+1, i+2, … (mod M) until the collision is resolved For Insert: Collision is resolved when an empty location is found For Find: Collision is resolved (found) when the item is found Collision is resolved (not found) when an empty location is found, or when index circles back to i Look at an example Collision Resolution Performance Theta(1) for Insert, Search for normal use, subject to the issues discussed below In normal use at most a few probes will be required before a collision is resolved Linear probing issues What happens as table fills with keys? Define LOAD FACTOR, = N/M How does affect linear probing performance? Consider a hash table of size M that is empty , using a good hash function Given a random key, x, what is the probability that x will be inserted into any location i in the table? Collision Resolution Consider now a hash table of size M that has a cluster of C consecutive locations that are filled Now given a random key, x, what is the probability that x will be inserted into the location immediately following the cluster? Discuss How can we "fix" this problem? Even AFTER a collision, we need to make all of the locations available to a key This way, the probability from filled locations will be redistributed throughout the empty locations in the table, rather than just being pushed down to the first empty location after the cluster Discuss Share with your friends:

The database is protected by copyright ©sckool.org 2022

send message