# These notes are intended for use by students in cs1501 at the University of Pittsburgh and no one else

 Page 8/24 Date 03.05.2017 Size 186.06 Kb. #18863

## Rabin Karp

• Ex: if B = 32
• h("CAT") === 67*322 + 65*321 + 84 == 70772
• To search for "CAT" we can thus "hash" all 3-char substrings of our text and test the values for equality
• Let's modify this somewhat to make it more useful / appropriate
• We need to keep the integer values of some reasonable size
• Ex: No larger than an int or long value
• We need to be able to incrementally update a value so that we can progress down a text string looking for a match

## Rabin Karp

• Both of these are taken care of in the Rabin Karp algorithm
• The hash values are calculated "mod" a large integer, to guarantee that we won't get overflow
• Due to properties of modulo arithmetic, characters can be "removed" from the beginning of a string almost as easily as they can be "added" to the end
• Idea is with each mismatch we "remove" the leftmost character from the hash value and we add the next character from the text to the hash value
• Show on board
• Let's look at the code

## Rabin Karp

• const int q = 33554393;
• const int d = 32;
• int rksearch(char *p, char *a)
• {
• int i, dM = 1, h1 = 0, h2 = 0;
• int M = strlen(p), N = strlen(a);
• for (i = 1; i < M; i++) dM = (d*dM) % q;
• for (i = 0; i < M; i++)
• {
• h1 = (h1*d+index(p[i])) % q; // hash pattern
• h2 = (h2*d+index(a[i])) % q; // hash beg. of text
• }
• for (i = 0; h1 != h2; i++)
• {
• h2 = (h2+d*q-index(a[i])*dM) % q; // remove 1st
• h2 = (h2*d+index(a[i+M])) % q; // add next
• if (i > N-M) return N;
• }
• return i;
• }

## Rabin Karp

• The algorithm as presented in the text is not quite correct – what is missing?
• Does not handle collisions
• It assumes that if the hash values match the strings match – this may not be the case
• Although with such a large "table size" a collision is not likely, it is possible
• How do we fix this?
• If hash values match we then compare the character values
• If they match, we have found the pattern
• If they do not match, we have a collision and we must continue the search

## Rabin Karp

• Runtime?
• Assuming no or few collisions, we must look at each character in the text at most two times
• As long as are arithmetic can be done in constant time (which it can as long as we are using fixed-length integers) then our overall runtime should be Theta(N) in the average case
• Note: In the worst case, the run-time is Theta(MN), just like the naïve algorithm
• However, this case is highly unlikely
• Why? Discuss
• However, we still haven't really improved on the "normal case" runtime

## Boyer Moore

• What if we took yet another approach?
• Look at the pattern from right to left instead of left to right
• Now, if we mismatch a character early, we have the potential to skip many characters with only one comparison
• Consider the following example:
• A = ABCDVABCDWABCDXABCDYABCDZ
• P = ABCDE
• If we first compare E and V, we learn two things:
• V does not match E
• V does not appear anywhere in the pattern
• How does that help us?

## Boyer Moore

• We can now skip the pattern over M positions, after only one comparison
• Continuing in the same fashion gives us a very good search time
• Show on board
• Assuming our search progresses as shown, how many comparisons are required?
• Will our search progress as shown?
• Not always, but when searching text with a relatively large alphabet, we often encounter characters that do not appear in the pattern
• This algorithm allows us to take advantage of this fact
• N/M

## Boyer Moore

• Details
• The technique we just saw is the mismatched character (MC) heuristic
• It is one of two heuristics that make up the Boyer Moore algorithm
• The second heuristic is similar to that of KMP, but processing from right to left
• Does MC always work so nicely?
• No – it depends on the text and pattern
• Since we are processing right to left, there are some characters in the text that we don't even look at
• We need to make sure we don't "miss" a potential match

## Boyer Moore

• Consider the following:
• A = XYXYXXYXYYXYXYZXYXYXXYXYYXYXYX
• P = XYXYZ
• Discuss on board
• Now the mismatched character DOES appear in the pattern
• When "sliding" the pattern to the right, we must make sure not to go farther than where the mismatched character in A is first seen (from the right) in P
• In the first comparison above, X does not match Z, but it does match an X two positions down (from the right)

## Boyer Moore

• How do we do it?
• Preprocess the pattern to create a skip array
• Array indexed on ALL characters in alphabet
• Each value indicates how many positions we can skip given a mismatch on that character in the text
• for all i skip[i] = M
• for (int j = 0; j < M; j++)
• skip[index(p[j])] = M - j - 1;
• Idea is that initially all chars in the alphabet can give the maximum skip
• Skip lessens as characters are found further to the right in the pattern