These notes are intended for use by students in cs1501 at the University of Pittsburgh and no one else



Download 186.06 Kb.
Page7/24
Date03.05.2017
Size186.06 Kb.
1   2   3   4   5   6   7   8   9   10   ...   24

Collision Resolution

      • Double Hashing
        • Idea: When a collision occurs, increment the index, just as in linear probing. However, now do not automatically choose 1 as the increment value
          • Instead use a second hash function to determine the increment
        • Discuss
        • Look at example
      • We must be careful to ensure that double hashing always "works"
        • Make sure increment is > 0
        • Make sure no index is tried twice before all are tried once
          • Show example

Collision Resolution

      • As N increases, double hashing shows a definite improvement over linear probing
        • Discuss
      • However, as N approaches M, both schemes degrade to Theta(N) performance
        • Since there are only M locations in the table, as it fills there become fewer empty locations remaining
        • Multiple collisions will occur even with double hashing
        • This is especially true for inserts and unsuccessful finds
          • Both of these continue until an empty location is found, and few of these exist
          • Thus it could take close to M probes before the collision is resolved
          • Since the table is almost full Theta(M) = Theta(N)

Collision Resolution

    • Open Addressing Issues
      • We have just seen that performance degrades as N approaches M
        • Typically for open addressing we want to keep the table partially empty
      • What about delete?
        • Discuss problem
        • Discuss pseudo-solution
        • Can we use hashing without delete?
          • Yes, in some cases (ex: compiler using language keywords)

Collision Resolution

  • Closed Addressing
    • Most common form is separate chaining
      • Use a simple linked-list at each location in the table
        • Look at example
        • Discuss performance
      • Note performance is dependent upon chain length
        • Chain length is determined by the load factor, 
        • As long as  is a small constant, performance is still Theta(1)
          • Graceful degradation
          • However, a poor hash function may degrade this into Theta(N)
          • Discuss

Collision Resolution

    • Would other collections improve over separate chaining?
      • Sorted array?
        • Space overhead if we make it large and copying overhead if we need to resize it
        • Inserts require shifting
      • BST?
        • Could work
        • But is it worth it?
          • Not really – separate chaining is simpler and we want a good hash function anyway

String Matching

  • Basic Idea:
    • Given a pattern string P, of length M
    • Given a text string, A, of length N
      • Do all characters in P match a substring of the characters in A, starting from some index i?
    • Brute Force (Naïve) Algorithm:
      • int brutesearch(char *p, char *a)
      • {
      • int i, j, M = strlen(p), N = strlen(a);
      • for (i = 0, j = 0; j < M && i < N; i++, j++)
      • if (a[i] != p[j]) { i -= j; j = -1; }
      • if (j == M) return i-M; else return i;
      • }
      • Do example

String Matching

  • Performance of Naïve algorithm?
    • Normal case?
      • Perhaps a few char matches occur prior to a mismatch
    • Worst case situation and run-time?
  • Theta(N + M) = Theta(N) when N >> M
  • A = XXXXXXXXXXXXXXXXXXXXXXXXXXY
  • P = XXXXY
  • P must be completely compared each time we move one index down A
  • M(N-M+1) = Theta(NM) when N >> M

String Matching

  • Improvements?
    • Two ideas
      • Improve the worst case performance
        • Good theoretically, but in reality the worst case does not occur very often for ASCII strings
        • Perhaps for binary strings it may be more important
      • Improve the normal case performance
        • This will be very helpful, especially for searches in long files

KMP

  • KMP (Knuth Morris Pratt)
    • Improves the worst case, but not the normal case
    • Idea is to prevent index from ever going "backward" in the text string
      • This will guarantee Theta(N) runtime in the worst case
    • How is it done?
      • Pattern is preprocessed to look for "sub" patterns
      • As a result of the preprocessing that is done, we can create a "next" array that is used to determine the next character in the pattern to examine

KMP

      • We don't want to worry too much about the details here
      • int kmpsearch(char *p, char *a)
      • {
      • int i, j, M = strlen(p), N = strlen(a);
      • initnext(p);
      • for (i = 0, j = 0; j < M && i < N; i++, j++)
      • while ((j >= 0) && (a[i] != p[j])) j = next[j];
      • if (j == M) return i-M; else return i;
      • }
      • Note that i never decreases and whenever i is not changing (in the while loop), j is increasing
      • Run-time is clearly Theta(N+M) = Theta(N) in the worst case
      • Useful if we are accessing the text as a continuous stream (it is not buffered)

Rabin Karp

  • Let's take a different approach:
    • We just discussed hashing as a way of efficiently accessing data
    • Can we also use it for string matching?
    • Consider the hash function we discussed for strings:
      • s[0]*Bn-1 + s[1]*Bn-2 + … + s[n-2]*B1 + s[n-1]
        • where B is some integer (31 in JDK)
      • Recall that we said that if B == number of characters in the character set, the result would be unique for all strings
      • Thus, if the integer values match, so do the strings

Download 186.06 Kb.

Share with your friends:
1   2   3   4   5   6   7   8   9   10   ...   24




The database is protected by copyright ©sckool.org 2020
send message

    Main page