These notes are intended for use by students in cs1501 at the University of Pittsburgh and no one else



Download 186.06 Kb.
Page8/24
Date03.05.2017
Size186.06 Kb.
#18863
1   ...   4   5   6   7   8   9   10   11   ...   24

Rabin Karp

    • Ex: if B = 32
      • h("CAT") === 67*322 + 65*321 + 84 == 70772
      • To search for "CAT" we can thus "hash" all 3-char substrings of our text and test the values for equality
    • Let's modify this somewhat to make it more useful / appropriate
      • We need to keep the integer values of some reasonable size
        • Ex: No larger than an int or long value
      • We need to be able to incrementally update a value so that we can progress down a text string looking for a match

Rabin Karp

    • Both of these are taken care of in the Rabin Karp algorithm
      • The hash values are calculated "mod" a large integer, to guarantee that we won't get overflow
      • Due to properties of modulo arithmetic, characters can be "removed" from the beginning of a string almost as easily as they can be "added" to the end
        • Idea is with each mismatch we "remove" the leftmost character from the hash value and we add the next character from the text to the hash value
        • Show on board
      • Let's look at the code

Rabin Karp

  • const int q = 33554393;
  • const int d = 32;
  • int rksearch(char *p, char *a)
  • {
  • int i, dM = 1, h1 = 0, h2 = 0;
  • int M = strlen(p), N = strlen(a);
  • for (i = 1; i < M; i++) dM = (d*dM) % q;
  • for (i = 0; i < M; i++)
  • {
  • h1 = (h1*d+index(p[i])) % q; // hash pattern
  • h2 = (h2*d+index(a[i])) % q; // hash beg. of text
  • }
  • for (i = 0; h1 != h2; i++)
  • {
  • h2 = (h2+d*q-index(a[i])*dM) % q; // remove 1st
  • h2 = (h2*d+index(a[i+M])) % q; // add next
  • if (i > N-M) return N;
  • }
  • return i;
  • }

Rabin Karp

    • The algorithm as presented in the text is not quite correct – what is missing?
      • Does not handle collisions
      • It assumes that if the hash values match the strings match – this may not be the case
      • Although with such a large "table size" a collision is not likely, it is possible
      • How do we fix this?
        • If hash values match we then compare the character values
          • If they match, we have found the pattern
          • If they do not match, we have a collision and we must continue the search

Rabin Karp

    • Runtime?
      • Assuming no or few collisions, we must look at each character in the text at most two times
      • As long as are arithmetic can be done in constant time (which it can as long as we are using fixed-length integers) then our overall runtime should be Theta(N) in the average case
      • Note: In the worst case, the run-time is Theta(MN), just like the naïve algorithm
        • However, this case is highly unlikely
          • Why? Discuss
      • However, we still haven't really improved on the "normal case" runtime

Boyer Moore

  • What if we took yet another approach?
    • Look at the pattern from right to left instead of left to right
      • Now, if we mismatch a character early, we have the potential to skip many characters with only one comparison
      • Consider the following example:
      • A = ABCDVABCDWABCDXABCDYABCDZ
      • P = ABCDE
      • If we first compare E and V, we learn two things:
        • V does not match E
        • V does not appear anywhere in the pattern
      • How does that help us?

Boyer Moore

      • We can now skip the pattern over M positions, after only one comparison
        • Continuing in the same fashion gives us a very good search time
        • Show on board
      • Assuming our search progresses as shown, how many comparisons are required?
      • Will our search progress as shown?
        • Not always, but when searching text with a relatively large alphabet, we often encounter characters that do not appear in the pattern
        • This algorithm allows us to take advantage of this fact
  • N/M

Boyer Moore

  • Details
    • The technique we just saw is the mismatched character (MC) heuristic
      • It is one of two heuristics that make up the Boyer Moore algorithm
      • The second heuristic is similar to that of KMP, but processing from right to left
    • Does MC always work so nicely?
      • No – it depends on the text and pattern
      • Since we are processing right to left, there are some characters in the text that we don't even look at
      • We need to make sure we don't "miss" a potential match

Boyer Moore

    • Consider the following:
    • A = XYXYXXYXYYXYXYZXYXYXXYXYYXYXYX
    • P = XYXYZ
      • Discuss on board
      • Now the mismatched character DOES appear in the pattern
        • When "sliding" the pattern to the right, we must make sure not to go farther than where the mismatched character in A is first seen (from the right) in P
        • In the first comparison above, X does not match Z, but it does match an X two positions down (from the right)

Boyer Moore

    • How do we do it?
    • Preprocess the pattern to create a skip array
      • Array indexed on ALL characters in alphabet
      • Each value indicates how many positions we can skip given a mismatch on that character in the text
        • for all i skip[i] = M
        • for (int j = 0; j < M; j++)
        • skip[index(p[j])] = M - j - 1;
      • Idea is that initially all chars in the alphabet can give the maximum skip
      • Skip lessens as characters are found further to the right in the pattern

Download 186.06 Kb.

Share with your friends:
1   ...   4   5   6   7   8   9   10   11   ...   24




The database is protected by copyright ©sckool.org 2022
send message

    Main page