Rabin Karp Substring Matching
Just some background before getting into Rabin Karp
Substring Matching Problem:
Given string s of length N, pattern p of length M, find the index of p in s if possible, otherwise return -1
Naive Approach
For each position i in s, check if s[i:i+M] == p
Time complexity:
Best Case: O(N+M)
Worst Case: O(NM)
Rabin Karp
Time complexity:
Best Case: O(N+M)
Worst Case: O(NM)
Average Case: O(N+M), therefore RabinKarp wins
Intuition:
Just like the naive approach, for each position i in s, check if s[i:i+M] == p.
But instead of doing the check with letter-by-letter comparison, we do the check by comparing hash(s[i:i+M]) and hash(p), if they match, then we do a letter-by-letter comparison to avoid false positive due to hash collision.
If letter-by-letter comparison works, then we find the pattern, otherwise do the same trick at next index i+1, until end of s.
The key why RabinKarp is faster: if we know hash(s[i:i+M]), then it takes no time to calculate hash(s[i+1:i+1+M]), such hash function is called rolling hash.
So the above formula describes how we could calculate hash of M characters starting at position i. Here, R is the radix(e.g it’s 256 in our case). Q is any big prime number.
If we calculate hash using this formula, it would be slow since there are some exponential calculations. There is something called the Honor’s method that that does this in O(M):
How do we calculate hash(s[i+1:i+1+M]) given hash(s[i:i+M])?
Note that in the above picture, R^{M-1} is a constant that can be pre-computed in advance.
Here’s a RabinKarp implementation for copy-paster :)