Knuth-Morris-Pratt (KMP) | String Searching Algorithm

In computer science, string searching is a common problem where we need to find the occurrences of a pattern within a larger text. The Knuth-Morris-Pratt (KMP) algorithm is a powerful string searching algorithm that provides an efficient solution to this problem. It is particularly useful when the pattern has repetitive elements. In this blog post, we’ll explore the KMP algorithm and implement it in C++.

Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm improves the efficiency of string searching by utilizing the information gathered from previous character comparisons. It avoids unnecessary character comparisons by exploiting the knowledge of the pattern itself.
The main idea behind the KMP algorithm is to create a lookup table, often referred to as the “failure function” or “LPS” (Longest (proper) Prefix which is also a Suffix) table. This table helps in determining the next possible position to start the comparison when a mismatch occurs.

The KMP algorithm consists of two main steps:

1. Preprocessing the pattern to compute a table of longest proper prefixes that are also suffixes (LPS table).
2. Searching the text using the LPS table to skip comparisons.
Preprocessing

The preprocessing step takes pattern as the input and computes an array of integers called the LPS table. The LPS table stores the length of the longest proper prefix that is also a suffix for each position in the pattern. A proper prefix is a prefix that is not equal to the whole string.
For example, for the pattern = "ABA", the LPS table is [0, 0, 1].
Here, LPS[2] = 1, since the character at pattern[0] == pattern[2].

Pseudocode

ComputeLPS(pattern):
# Initialize LPS table with zeros
lps = {0}

# Initialize pointers
i = 1 # Current position in pattern
j = 0 # Length of current matching prefix

# Loop until end of pattern
for i = 1 to len(pattern):
# Case 1: Characters match
if pattern[i] == pattern[j]:
j++
# Case 2: Characters do not match
else:
# Decrement j to previous matching prefix length
while j > 0 && pattern[i] != pattern[j]:
j = lps[j-1]
lps[i] = j;

# Return LPS table
return lps

Explanation

The LPS table is computed by using a loop that iterates over the pattern and maintains two pointers: i and j. The pointer i tracks the current position in the pattern, while the pointer j tracks the length of the current matching prefix.

The loop works as follows:

• If characters at i and j match, increment both i and j and store j in LPS table at i.
• If the characters at i and j do not match, decrement j to the value stored in the LPS table at LPS[j-1] until pattern[i] == pattern[j] OR j == 0.

The loop ends when i reaches the end of the pattern.

Searching

The searching step takes the text and the pattern as input and uses the LPS table to find all the occurrences of the pattern in the text. It also maintains two pointers: i and j. The pointer i tracks the current position in the text, while the pointer j tracks the current position in the pattern. Initially, both i and j are set to 0.

Pseudocode

SearchKMP(text, pattern):
# Compute LPS table for pattern
lps = computeLPS(pattern)

# Initialize pointers
i = 0 # Current position in text
j = 0 # Current position in pattern

# Loop until end of text
while i < len(text):
# Case 1: Characters match
if text[i] == pattern[j]:
# Increment both pointers
i += 1
j += 1
# If end of pattern is reached, report a match
if j == len(pattern):
print("Pattern found at index", i-j)
# Reset j to previous matching prefix length
j = lps[j-1]
# Case 2: Characters do not match
else:
# Decrement j to previous matching prefix length
if j > 0:
j = lps[j-1]
# Or increment i if there's no prefix
else:
i += 1

Explanation

text = “BABABA”, pattern = “ABA”
LPS table = [0, 0, 1]

The loop works as follows:

• If i is equal to the length of the text, end the loop.
• If the characters at i and j match, increment both i and j. If j is equal to the length of the pattern, report a match at i-j and reset j to lps[j-1].
• If the characters at i and j do not match, decrement j to lps[j-1], or do nothing if j is already 0. Increment i in this case.

This implies that if we’re searching for the pattern = "ABA" in text = "BABABA", and we reach text[3] and pattern[2], we get a match. However now, we would set the iterator j (traversing pattern) to 1 ie j = lps[2] = 1 (instead of 0) and we could continue to search the remaining pattern "BA".
This saves us the extra iterations by not resetting the iterator j (traversing pattern) to 0 and iterator i (traversing text) to (i - j + 1) or i = 2.

Code Implementation

//
//  main.cpp
//  KMP Algorithm
//
//  Created by Himanshu on 31/05/23.
//

#include <iostream>
#include <vector>
#include <string>
using namespace std;

vector<int> constructLPS (string pattern) {
int n = (int) pattern.length();
vector<int> lps(n, 0);
int j = 0;

for (int i = 1; i < n; i++) {

if (pattern[i] == pattern[j]) {
j++;
} else {
while (j > 0 && pattern[i] != pattern[j]) {
j = lps[j-1];
}
}

lps[i] = j;
}

return lps;
}

void kmpStringSearch (string text, string pattern) {
int n = (int) text.length();
int m = (int) pattern.length();
vector<int> lps = constructLPS(pattern);

int i = 0;
int j = 0;

while (i < n) {

if (text[i] == pattern[j]) {
i++;
j++;
}

//Since, j = m (size of pattern), we set j to lps[j-1]
if (j == m) {
cout<<"Pattern found at index "<< i - j<<endl;
j = lps[j-1];
} else if (i < n && text[i] != pattern[j]) {

if (j != 0) {
j = lps[j-1];
} else {
i++;
}
}

}

}

int main() {
string text = "BABABA";
string pattern = "ABA";

kmpStringSearch(text, pattern);

return 0;
}


Output

Pattern found at index 1
Pattern found at index 3

Time complexity: O(m + n)
where m is the length of the pattern and n is the length of the text.

Here’s a working code with step-by-step code run: KMP Algorithm