The Rabin-Karp algorithm is an efficient string-searching algorithm that uses hashing to find all occurrences of a pattern string within a larger text.
Here is the problem we are trying to solve:
If we have text string S and pattern string P, we want to determine whether or not P is found in S, i.e. P is a substring of S.
The naive algorithm for this problem would involve sliding the sub-string over the string. If the length of the sub-string in \(m\) and the length of the string in \(n\), we will need \(O(nm)\) time because string comparision has time complexity in line with the length of the string.
Core Idea
It uses hashing instead of direct character-by-character comparison:
- Hash Calculation
- Compute hash of the pattern.
- Compute hash of the first window of text.
- Rolling Hash Update
- Efficiently update hash when sliding the window:
- Remove leading character’s contribution.
- Multiply by base (d).
- Add trailing character’s contribution.
- Efficiently update hash when sliding the window:
- Comparison
- If hash values match, check characters directly.
- Continue until all windows are checked.
Rolling Hash
We treat the string as a number in a chosen base (d) (often the alphabet size, e.g., 256 for ASCII).
We also choose a large prime (q) to reduce collisions by taking results modulo (q).
For a substring \(S = s_0 s_1 \dots s_{m-1}\), its hash is:
Here:
- \(s_i\) is the integer code of the character.
- \(d\) is the base.
- \(q\) is the modulus.
Suppose we have the hash of substring \(T[i \dots i+m-1]\), and we want the hash of the next window \(T[i+1 \dots i+m]\).
Let:
- \(h_i\) = hash of window starting at position \(i\).
- \(h_{i+1}\) = hash of window starting at position \(i+1\).
We can compute \(h_{i+1}\) from \(h_i\) in constant time:
For this, all we have the store is the first character in the segment.
Complexity Analysis
| Case | Time Complexity | Explanation |
|---|---|---|
| Best / Average Case | \(O(n + m)\) | Hash comparisons are fast; collisions are rare. Expected linear time. |
| Worst Case | \(O(n \cdot m)\) | If many hash collisions occur, each requires full character comparison. |
| Space Complexity | \(O(1)\) | Only a few integer variables for hashes and constants. |
- Expected performance: Linear in practice, because good hash functions minimize collisions.
- Worst-case performance: Quadratic, but rare in real-world applications.
Example problems
It is efficient for multiple pattern matching (can hash multiple patterns and check simultaneously).
As a result, it is useful in plagiarism detection, search engines, and DNA sequence analysis.