Skip to main content

The Boyer-Moore Algorithm

· 3 min read
Arteii
Student

Pattern matching is a fundamental problem in computer science with applications in text processing, search engines, DNA sequencing, and more. The Boyer-Moore algorithm, introduced by Robert S. Boyer and J Strother Moore in 1977, remains one of the most efficient algorithms for string matching in real-world applications. It takes advantage of information gleaned during the mismatching process to skip portions of the text, often achieving sublinear time complexity on average.

Problem Statement

Given a text string TT of length nn and a pattern string PP of length mm, the goal of pattern matching is to find all occurrences of PP as a substring of TT. The brute-force approach requires checking each possible alignment of PP with TT, which results in a time complexity of O(nm)O(n \cdot m). The Boyer-Moore algorithm improves upon this by skipping unnecessary comparisons.

The Boyer-Moore Algorithm

The Boyer-Moore algorithm matches the pattern PP from right to left, starting at the end of the pattern. If a mismatch occurs, the algorithm shifts the pattern over the text based on two preprocessed heuristics: the Bad Character rule and the Good Suffix rule.

Bad Character Rule

The Bad Character rule works by identifying mismatches and using the information to shift the pattern efficiently. Given a mismatch between the text character T[i]T[i] and the corresponding pattern character P[j]P[j], we look for the rightmost occurrence of T[i]T[i] in PP.

  • If T[i]T[i] exists in PP, we shift the pattern such that this occurrence aligns with T[i]T[i].
  • If T[i]T[i] does not exist in PP, we shift the entire pattern past the mismatched character.

Mathematically, let:

shiftbad_char(i,j)=max(1,jrightmost(T[i]))\text{shift}_{\text{bad\_char}}(i, j) = \max(1, j - \text{rightmost}(T[i]))

where rightmost(T[i])\text{rightmost}(T[i]) is the rightmost occurrence of T[i]T[i] in PP, or -1 if it does not appear.

Example:

Consider the pattern P=ABCAP = \text{ABCA} and text T=CBABCAT = \text{CBABCA}. Suppose a mismatch occurs at position i=2i = 2, where T[2]=BT[2] = \text{B} and P[1]=AP[1] = \text{A}. Since B\text{B} exists in PP at position 1, we shift the pattern such that this occurrence aligns with T[2]T[2].

Good Suffix Rule

The Good Suffix rule leverages suffixes in the pattern to skip unnecessary shifts. When a mismatch occurs after a suffix of the pattern has been matched, the algorithm shifts the pattern such that the next occurrence of this suffix aligns with the text.

Mathematically, let:

shiftgood_suffix(j)=mnext_suffix(j)\text{shift}_{\text{good\_suffix}}(j) = m - \text{next\_suffix}(j)

where next_suffix(j)\text{next\_suffix}(j) is the position of the next occurrence of the matched suffix in the pattern.

There are two cases for shifting based on good suffixes:

  • Case 1: The matched suffix appears elsewhere in the pattern. In this case, we shift the pattern to align the next occurrence of the suffix.
  • Case 2: The matched suffix does not appear in the pattern, but some prefix of the pattern matches a suffix of the text. In this case, we shift the pattern such that the prefix aligns with the suffix.

Example:

For P=ABCAP = \text{ABCA} and T=CBABCAT = \text{CBABCA}, if a mismatch occurs after matching the suffix "BCA", the pattern is shifted to the next occurrence of "BCA" in PP.

Proof of Efficiency

The Boyer-Moore algorithm operates in sublinear time on average due to the efficiency of its heuristics. Let mm be the length of the pattern and nn the length of the text. The algorithm's worst-case time complexity is O(nm)O(n \cdot m), but in practice, the average time complexity is much closer to O(n/m)O(n / m), particularly when the alphabet size is large.

This sublinear performance arises because the pattern shifts more than one character in many cases, especially when there are frequent mismatches or when the pattern is long relative to the text.

Performance Comparison Between Bad Character and Good Suffix Heuristics

The Bad Character rule tends to perform well when the alphabet is large and when mismatches are common because it leverages the presence of mismatched characters to skip unnecessary comparisons. The Good Suffix rule is more effective when the pattern has repeated structures or when mismatches occur near the end of the pattern, allowing it to skip larger sections of the text.

In practical applications, the combination of these two heuristics provides superior performance over either heuristic used in isolation.

  1. Bad Character: More efficient with larger alphabets and random text.
  2. Good Suffix: More efficient with repetitive patterns and long suffix matches.

In most cases, the Boyer-Moore algorithm uses both heuristics together, allowing the pattern to shift as far as the larger of the two heuristics suggests.

Conclusion

The Boyer-Moore algorithm is one of the most efficient string-matching algorithms for practical use due to its ability to skip portions of the text rather than performing a brute-force search. By employing both the Bad Character and Good Suffix heuristics, the algorithm achieves sublinear time complexity on average. Its efficiency is particularly notable in scenarios where patterns are long or mismatches occur frequently.

References

  • Robert S. Boyer - Wikipedia
  • J Strother Moore - Wikipedia
  • Boyer–Moore string-search algorithm - Wikipedia
  • Boyer-Moore Algorithm - GeeksforGeeks
  • A. V. Aho, and M. J. Corasick, "Efficient String Matching: An Aid to Bibliographic Search," Communications of the ACM, 1975.
  • T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, MIT Press, 2009.