|
|
Page | 7/24 | Date | 03.05.2017 | Size | 186.06 Kb. |
| - Double Hashing
- Idea: When a collision occurs, increment the index, just as in linear probing. However, now do not automatically choose 1 as the increment value
- Instead use a second hash function to determine the increment
- Discuss
- Look at example
- We must be careful to ensure that double hashing always "works"
- Make sure increment is > 0
- Make sure no index is tried twice before all are tried once
Collision Resolution - As N increases, double hashing shows a definite improvement over linear probing
- However, as N approaches M, both schemes degrade to Theta(N) performance
- Since there are only M locations in the table, as it fills there become fewer empty locations remaining
- Multiple collisions will occur even with double hashing
- This is especially true for inserts and unsuccessful finds
- Both of these continue until an empty location is found, and few of these exist
- Thus it could take close to M probes before the collision is resolved
- Since the table is almost full Theta(M) = Theta(N)
Collision Resolution - Open Addressing Issues
- We have just seen that performance degrades as N approaches M
- Typically for open addressing we want to keep the table partially empty
- What about delete?
- Discuss problem
- Discuss pseudo-solution
- Can we use hashing without delete?
- Yes, in some cases (ex: compiler using language keywords)
Collision Resolution - Closed Addressing
- Most common form is separate chaining
- Use a simple linked-list at each location in the table
- Look at example
- Discuss performance
- Note performance is dependent upon chain length
- Chain length is determined by the load factor,
- As long as is a small constant, performance is still Theta(1)
- Graceful degradation
- However, a poor hash function may degrade this into Theta(N)
- Discuss
Collision Resolution - Would other collections improve over separate chaining?
- Sorted array?
- Space overhead if we make it large and copying overhead if we need to resize it
- Inserts require shifting
- BST?
- Could work
- But is it worth it?
- Not really – separate chaining is simpler and we want a good hash function anyway
String Matching - Basic Idea:
- Given a pattern string P, of length M
- Given a text string, A, of length N
- Do all characters in P match a substring of the characters in A, starting from some index i?
- Brute Force (Naïve) Algorithm:
- int brutesearch(char *p, char *a)
- {
- int i, j, M = strlen(p), N = strlen(a);
- for (i = 0, j = 0; j < M && i < N; i++, j++)
- if (a[i] != p[j]) { i -= j; j = -1; }
- if (j == M) return i-M; else return i;
- }
- Do example
String Matching - Performance of Naïve algorithm?
- Normal case?
- Perhaps a few char matches occur prior to a mismatch
- Worst case situation and run-time?
- Theta(N + M) = Theta(N) when N >> M
- A = XXXXXXXXXXXXXXXXXXXXXXXXXXY
- P = XXXXY
- P must be completely compared each time we move one index down A
- M(N-M+1) = Theta(NM) when N >> M
String Matching - Improvements?
- Two ideas
- Improve the worst case performance
- Good theoretically, but in reality the worst case does not occur very often for ASCII strings
- Perhaps for binary strings it may be more important
- Improve the normal case performance
- This will be very helpful, especially for searches in long files
KMP - KMP (Knuth Morris Pratt)
- Improves the worst case, but not the normal case
- Idea is to prevent index from ever going "backward" in the text string
- This will guarantee Theta(N) runtime in the worst case
- How is it done?
- Pattern is preprocessed to look for "sub" patterns
- As a result of the preprocessing that is done, we can create a "next" array that is used to determine the next character in the pattern to examine
KMP - We don't want to worry too much about the details here
- int kmpsearch(char *p, char *a)
- {
- int i, j, M = strlen(p), N = strlen(a);
- initnext(p);
- for (i = 0, j = 0; j < M && i < N; i++, j++)
- while ((j >= 0) && (a[i] != p[j])) j = next[j];
- if (j == M) return i-M; else return i;
- }
- Note that i never decreases and whenever i is not changing (in the while loop), j is increasing
- Run-time is clearly Theta(N+M) = Theta(N) in the worst case
- Useful if we are accessing the text as a continuous stream (it is not buffered)
Rabin Karp - Let's take a different approach:
- We just discussed hashing as a way of efficiently accessing data
- Can we also use it for string matching?
- Consider the hash function we discussed for strings:
- s[0]*Bn-1 + s[1]*Bn-2 + … + s[n-2]*B1 + s[n-1]
- where B is some integer (31 in JDK)
- Recall that we said that if B == number of characters in the character set, the result would be unique for all strings
- Thus, if the integer values match, so do the strings
Share with your friends: |
The database is protected by copyright ©sckool.org 2020
send message
|
|