In computer science, string searching is a common problem where we need to find the occurrences of a pattern within a larger text. The Knuth-Morris-Pratt (KMP) algorithm is a powerful string searching algorithm that provides an efficient solution to this problem. It is particularly useful when the pattern has repetitive elements. In this blog post, we’ll explore the KMP algorithm and implement it in C++.
Knuth-Morris-Pratt (KMP) Algorithm
The KMP algorithm improves the efficiency of string searching by utilizing the information gathered from previous character comparisons. It avoids unnecessary character comparisons by exploiting the knowledge of the pattern itself.
The main idea behind the KMP algorithm is to create a lookup table, often referred to as the “failure function” or “LPS” (Longest (proper) Prefix which is also a Suffix) table. This table helps in determining the next possible position to start the comparison when a mismatch occurs.
The KMP algorithm consists of two main steps:
- Preprocessing the pattern to compute a table of longest proper prefixes that are also suffixes (LPS table).
- Searching the text using the LPS table to skip comparisons.
Preprocessing
The preprocessing step takes pattern
as the input and computes an array of integers called the LPS table
. The LPS table
stores the length of the longest proper prefix that is also a suffix for each position in the pattern. A proper prefix is a prefix that is not equal to the whole string.
For example, for the pattern = "ABA"
, the LPS table
is [0, 0, 1]
.
Here, LPS[2] = 1
, since the character at pattern[0] == pattern[2]
.
Pseudocode
ComputeLPS(pattern): # Initialize LPS table with zeros lps = {0} # Initialize pointers i = 1 # Current position in pattern j = 0 # Length of current matching prefix # Loop until end of pattern for i = 1 to len(pattern): # Case 1: Characters match if pattern[i] == pattern[j]: j++ # Case 2: Characters do not match else: # Decrement j to previous matching prefix length while j > 0 && pattern[i] != pattern[j]: j = lps[j-1] lps[i] = j; # Return LPS table return lps
Explanation
The LPS table is computed by using a loop that iterates over the pattern
and maintains two pointers: i and j
. The pointer i
tracks the current position in the pattern
, while the pointer j
tracks the length of the current matching prefix.
The loop works as follows:
- If characters at
i
andj
match, increment bothi
andj
and storej
inLPS
table ati
. - If the characters at
i
andj
do not match, decrementj
to the value stored in theLPS
table atLPS[j-1]
untilpattern[i] == pattern[j]
ORj == 0
.
The loop ends when i
reaches the end of the pattern
.
Searching
The searching step takes the text
and the pattern
as input and uses the LPS table
to find all the occurrences of the pattern
in the text
. It also maintains two pointers: i and j
. The pointer i
tracks the current position in the text
, while the pointer j
tracks the current position in the pattern
. Initially, both i
and j
are set to 0
.
Pseudocode
SearchKMP(text, pattern): # Compute LPS table for pattern lps = computeLPS(pattern) # Initialize pointers i = 0 # Current position in text j = 0 # Current position in pattern # Loop until end of text while i < len(text): # Case 1: Characters match if text[i] == pattern[j]: # Increment both pointers i += 1 j += 1 # If end of pattern is reached, report a match if j == len(pattern): print("Pattern found at index", i-j) # Reset j to previous matching prefix length j = lps[j-1] # Case 2: Characters do not match else: # Decrement j to previous matching prefix length if j > 0: j = lps[j-1] # Or increment i if there's no prefix else: i += 1
Explanation
text = “BABABA”, pattern = “ABA”
LPS table = [0, 0, 1]
Iteration | Value of i | Value of j | text[i] | pattern[j] | Pattern Found |
1 | 0 | 0 | B | A | No |
2 | 1 | 0 | A | A | No |
3 | 2 | 1 | B | B | No |
4 | 3 | 2 | A | A | Yes |
5 | 4 | 1 | B | B | No |
6 | 5 | 2 | A | A | Yes |
The loop works as follows:
- If i is equal to the length of the
text
, end the loop. - If the characters at
i
andj
match, increment bothi
andj
. Ifj
is equal to the length of thepattern
, report a match ati-j
and reset j tolps[j-1]
. - If the characters at
i
andj
do not match, decrementj
tolps[j-1]
, or do nothing ifj
is already0
. Incrementi
in this case.
This implies that if we’re searching for the
in pattern
= "ABA"text = "BABABA"
, and we reach
and text
[3]pattern[2]
, we get a match. However now, we would set the iterator j (traversing pattern)
to 1
ie j = lps[2] = 1
(instead of 0
) and we could continue to search the remaining
. pattern
"BA"
This saves us the extra iterations by not resetting the iterator j (traversing pattern)
to 0
and iterator i (traversing text)
to (i - j + 1) or i = 2
.
Code Implementation
//
// main.cpp
// KMP Algorithm
//
// Created by Himanshu on 31/05/23.
//
#include <iostream>
#include <vector>
#include <string>
using namespace std;
vector<int> constructLPS (string pattern) {
int n = (int) pattern.length();
vector<int> lps(n, 0);
int j = 0;
for (int i = 1; i < n; i++) {
if (pattern[i] == pattern[j]) {
j++;
} else {
while (j > 0 && pattern[i] != pattern[j]) {
j = lps[j-1];
}
}
lps[i] = j;
}
return lps;
}
void kmpStringSearch (string text, string pattern) {
int n = (int) text.length();
int m = (int) pattern.length();
vector<int> lps = constructLPS(pattern);
int i = 0;
int j = 0;
while (i < n) {
if (text[i] == pattern[j]) {
i++;
j++;
}
//Since, j = m (size of pattern), we set j to lps[j-1]
if (j == m) {
cout<<"Pattern found at index "<< i - j<<endl;
j = lps[j-1];
} else if (i < n && text[i] != pattern[j]) {
if (j != 0) {
j = lps[j-1];
} else {
i++;
}
}
}
}
int main() {
string text = "BABABA";
string pattern = "ABA";
kmpStringSearch(text, pattern);
return 0;
}
Output
Pattern found at index 1 Pattern found at index 3
Time complexity: O(m + n)
where m is the length of the pattern and n is the length of the text.
Here’s a working code with step-by-step code run: KMP Algorithm