In computer science, string searching is a common problem where we need to find the occurrences of a pattern within a larger text. The Knuth-Morris-Pratt (KMP) algorithm is a powerful string searching algorithm that provides an efficient solution to this problem. It is particularly useful when the pattern has repetitive elements. In this blog post, we’ll explore the KMP algorithm and implement it in C++.
Knuth-Morris-Pratt (KMP) Algorithm
The KMP algorithm improves the efficiency of string searching by utilizing the information gathered from previous character comparisons. It avoids unnecessary character comparisons by exploiting the knowledge of the pattern itself.
The main idea behind the KMP algorithm is to create a lookup table, often referred to as the “failure function” or “LPS” (Longest (proper) Prefix which is also a Suffix) table. This table helps in determining the next possible position to start the comparison when a mismatch occurs.
The KMP algorithm consists of two main steps:
- Preprocessing the pattern to compute a table of longest proper prefixes that are also suffixes (LPS table).
- Searching the text using the LPS table to skip comparisons.
Preprocessing
The preprocessing step takes pattern as the input and computes an array of integers called the LPS table. The LPS table stores the length of the longest proper prefix that is also a suffix for each position in the pattern. A proper prefix is a prefix that is not equal to the whole string.
For example, for the pattern = "ABA", the LPS table is [0, 0, 1].
Here, LPS[2] = 1, since the character at pattern[0] == pattern[2].
Pseudocode
ComputeLPS(pattern):
# Initialize LPS table with zeros
lps = {0}
# Initialize pointers
i = 1 # Current position in pattern
j = 0 # Length of current matching prefix
# Loop until end of pattern
for i = 1 to len(pattern):
# Case 1: Characters match
if pattern[i] == pattern[j]:
j++
# Case 2: Characters do not match
else:
# Decrement j to previous matching prefix length
while j > 0 && pattern[i] != pattern[j]:
j = lps[j-1]
lps[i] = j;
# Return LPS table
return lps
Explanation
The LPS table is computed by using a loop that iterates over the pattern and maintains two pointers: i and j. The pointer i tracks the current position in the pattern, while the pointer j tracks the length of the current matching prefix.
The loop works as follows:
- If characters at
iandjmatch, increment bothiandjand storejinLPStable ati. - If the characters at
iandjdo not match, decrementjto the value stored in theLPStable atLPS[j-1]untilpattern[i] == pattern[j]ORj == 0.
The loop ends when i reaches the end of the pattern.
Searching
The searching step takes the text and the pattern as input and uses the LPS table to find all the occurrences of the pattern in the text. It also maintains two pointers: i and j. The pointer i tracks the current position in the text, while the pointer j tracks the current position in the pattern. Initially, both i and j are set to 0.
Pseudocode
SearchKMP(text, pattern):
# Compute LPS table for pattern
lps = computeLPS(pattern)
# Initialize pointers
i = 0 # Current position in text
j = 0 # Current position in pattern
# Loop until end of text
while i < len(text):
# Case 1: Characters match
if text[i] == pattern[j]:
# Increment both pointers
i += 1
j += 1
# If end of pattern is reached, report a match
if j == len(pattern):
print("Pattern found at index", i-j)
# Reset j to previous matching prefix length
j = lps[j-1]
# Case 2: Characters do not match
else:
# Decrement j to previous matching prefix length
if j > 0:
j = lps[j-1]
# Or increment i if there's no prefix
else:
i += 1
Explanation
text = “BABABA”, pattern = “ABA”
LPS table = [0, 0, 1]
| Iteration | Value of i | Value of j | text[i] | pattern[j] | Pattern Found |
| 1 | 0 | 0 | B | A | No |
| 2 | 1 | 0 | A | A | No |
| 3 | 2 | 1 | B | B | No |
| 4 | 3 | 2 | A | A | Yes |
| 5 | 4 | 1 | B | B | No |
| 6 | 5 | 2 | A | A | Yes |
The loop works as follows:
- If i is equal to the length of the
text, end the loop. - If the characters at
iandjmatch, increment bothiandj. Ifjis equal to the length of thepattern, report a match ati-jand reset j tolps[j-1]. - If the characters at
iandjdo not match, decrementjtolps[j-1], or do nothing ifjis already0. Incrementiin this case.
This implies that if we’re searching for the in pattern = "ABA"text = "BABABA", and we reach and text[3]pattern[2], we get a match. However now, we would set the iterator j (traversing pattern) to 1 ie j = lps[2] = 1 (instead of 0) and we could continue to search the remaining . pattern "BA"
This saves us the extra iterations by not resetting the iterator j (traversing pattern) to 0 and iterator i (traversing text) to (i - j + 1) or i = 2.
Code Implementation
//
// main.cpp
// KMP Algorithm
//
// Created by Himanshu on 31/05/23.
//
#include <iostream>
#include <vector>
#include <string>
using namespace std;
vector<int> constructLPS (string pattern) {
int n = (int) pattern.length();
vector<int> lps(n, 0);
int j = 0;
for (int i = 1; i < n; i++) {
if (pattern[i] == pattern[j]) {
j++;
} else {
while (j > 0 && pattern[i] != pattern[j]) {
j = lps[j-1];
}
}
lps[i] = j;
}
return lps;
}
void kmpStringSearch (string text, string pattern) {
int n = (int) text.length();
int m = (int) pattern.length();
vector<int> lps = constructLPS(pattern);
int i = 0;
int j = 0;
while (i < n) {
if (text[i] == pattern[j]) {
i++;
j++;
}
//Since, j = m (size of pattern), we set j to lps[j-1]
if (j == m) {
cout<<"Pattern found at index "<< i - j<<endl;
j = lps[j-1];
} else if (i < n && text[i] != pattern[j]) {
if (j != 0) {
j = lps[j-1];
} else {
i++;
}
}
}
}
int main() {
string text = "BABABA";
string pattern = "ABA";
kmpStringSearch(text, pattern);
return 0;
}
Output
Pattern found at index 1 Pattern found at index 3
Time complexity: O(m + n)
where m is the length of the pattern and n is the length of the text.
Here’s a working code with step-by-step code run: KMP Algorithm