I've been working on web based projects built mainly with PHP and JavaScript, where I mostly use Zend Framework and jQuery. I am interested in any webpage optimizations techniques - for a faster web! Stoimen is a DZone MVB and is not an employee of DZone and has posted 96 posts at DZone. You can read more from them at their website. View Full User Profile

Algorithm of the Week: Boyer-Moore String Searching

04.17.2012
| 28890 views |
  • submit to reddit

Have you ever asked yourself which algorithm is used to find a word after clicking Ctrl+F and typing something? Well I guess you know the answer from the title, but in this article you’ll find out how exactly this is done.

As we saw from the Morris-Pratt string searching we don’t need to compare the text and the pattern character by character. Some comparisons can be skipped in order to improve the performance of the string searching. Indeed the brute force string searching and the Rabin-Karp algorithm are quite slow only because they compare the pattern and the text character by character.

On the other hand, the Morris-Pratt algorithm is a very good improvement of the brute force string searching, but the question remains: Is there any algorithm that is faster than Morris-Pratt – is there any way to skip more comparisons and to move the pattern faster?

It’s clear that if we have to find whether a single character is contained into a text we need at least “n” steps, where n is the length of the text. Once we have to find whether a pattern with the length of “m” is contained into a text with length of “n” the case is getting a little more complex.

However, the answer is that there is such an algorithm that is faster and more suitable than Morris-Pratt. This is the Boyer-Moore string searching.

Overview

Boyer-Moore is an algorithm that improves the performance of pattern searching into a text by considering some observations. It was defined in 1977 by Robert S. Boyer and J Strother Moore and it consists of some specific features.

First of all this algorithm starts comparing the pattern from the leftmost part of the text and moves to the right, as shown in the picture below.



Unlike other string searching algorithms, though, Boyer-Moore compares the pattern against a possible match from right to left as shown below.



The main idea of Boyer-Moore, in order to improve the performance, are some observations of the pattern. In the terminology of this algorithm they are called good-suffix and bad-character shifts. Let’s see in the following examples what they are standing for.

Good-suffix Shifts

Just like the Morris-Pratt algorithm, we start to compare the pattern against some portion of the text where a possible match will occur. In Boyer-Moore, as I said, this is done from the rightmost letter of the pattern. After some characters have matched we find a mismatch.


So how can we move the pattern to the right in order to skip unusual comparisons. To answer this question we need to explore the pattern. Let’s say there is a portion of the pattern that is repeated inside the pattern itself, like it is shown in the picture below.



In this case we must move the pattern, so the repeated portion must now align with its first occurrence in the pattern.

A variation of this case is when the portion from the pattern A overlaps with another portion that consists of the same characters.



Yet again, the shift must align the second portion with its first occurrence.

Finally, only a portion of A, let’s say “B”, can happen to occur in the very beginning of the pattern, as on the diagram below.



Now we must align the left end of the pattern with the rightmost occurrence of “B”.

Bad Character Shifts

Beside the good-suffix shifts, the Boyer-Moore algorithm makes use of the so called "bad-character shifts". In the case of a mismatch, we can skip comparisons in case the character in the text doesn’t happen to appear in the pattern. To become clearer let’s see the following examples.


In the picture above we see that the mismatched character “B” from the text appears only in the beginning of the pattern. Thus, we can simply shift the pattern to the right and align both characters B, skipping comparisons. An even better case is described by the following diagram where the mismatched letter isn’t contained in the pattern at all. Then we can shift forward the whole pattern.



Maximum of Good-suffix and Bad-Character shifts

Boyer-Moore needs both good-suffix and bad-character shifts in order to speed up searching performance. After a mismatch, the maximum of both is considered in order to move the pattern to the right.

Complexity

It’s clear that Boyer-Moore is faster than Morris-Pratt, but actually its worst-case complexity is O(n+m). The thing is that in natural language search, Boyer-Moore does pretty well.




Implementation

Finally let’s see the implementation in PHP, which can be easily “transcribed” into any other programming language. The only thing we need is the structures for bad-character shifts and good-suffixes shifts.

<?php
 
/**
 * Pattern we're searching for
 *
 * @var string
 */
$pattern = 'gloria';
 
/**
 * The text we're searching in
 *
 * @var string
 */
$text = 'Sic transit gloria mundi, non transit gloria Gundi!';
 
/**
 * Calculates the suffixes for a given pattern
 *
 * @param string $pattern
 * @param array  $suffixes
 */
function suffixes($pattern, &$suffixes)
{
   $m = strlen($pattern);
 
   $suffixes[$m - 1] = $m;
   $g = $m - 1;
 
   for ($i = $m - 2; $i >= 0; --$i) {
      if ($i > $g && $suffixes[$i + $m - 1 - $f] < $i - $g) {
         $suffixes[$i] = $suffixes[$i + $m - 1 - $f];
      } else {
         if ($i < $g) {
            $g = $i;
         }
         $f = $i;
 
         while ($g >= 0 && $pattern[$g] == $pattern[$g + $m - 1 - $f]) {
            $g--;
         }
         $suffixes[$i] = $f - $g;
      }
   }
}
 
/**
 * Fills in the array of bad characters.
 *
 * @param string $pattern
 * @param array  $badChars
 */
function badCharacters($pattern, &$badChars)
{
   $m = strlen($pattern);
 
   for ($i = 0; $i < $m - 1; ++$i) {
      $badChars[$pattern{$i}] = $m - $i - 1;
   }
}
 
/**
 * Fills in the array of good suffixes
 *
 * @param string $pattern
 * @param array  $goodSuffixes
 */
function goodSuffixes($pattern, &$goodSuffixes)
{
   $m 		= strlen($pattern);
   $suff 	= array();
 
   suffixes($pattern, $suff);
 
   for ($i = 0; $i < $m; $i++) {
      $goodSuffixes[$i] = $m;
   }
 
   for ($i = $m - 1; $i >= 0; $i--) {
      if ($suff[$i] == $i + 1) {
         for ($j = 0; $j < $m - $i - 1; $j++) {
            if ($goodSuffixes[$j] == $m) {
               $goodSuffixes[$j] = $m - $i - 1;
            }
         }
      }
   }
 
   for ($i = 0; $i < $m - 2; $i++) {
      $goodSuffixes[$m - 1 - $suff[$i]] = $m - $i - 1;
   }
}
 
/**
 * Performs a search of the pattern into a given text
 *
 * @param string $pattern
 * @param string $text
 */
function boyer_moore($pattern, $text)
{
   $n = strlen($text);
   $m = strlen($pattern);
 
   $goodSuffixes 	= array();
   $badCharacters 	= array();
 
   goodSuffixes($pattern, &$goodSuffixes);
   badCharacters($pattern, &$badCharacters);
 
   $j = 0;
   while ($j < $n - $m) {
      for ($i = $m - 1; $i >= 0 && $pattern[$i] == $text[$i + $j]; $i--);
      if ($i < 0) {
         // note that if the substring occurs more
         // than once into the text, the algorithm will
         // print out each position of the substring
         echo $j;
         $j += $goodSuffixes[0];
      } else {
         $j += max($goodSuffixes[$i], $badCharacters[$text[$i + $j]] - $m + $i + 1);
      }
   }
}
 
// search using Boyer-Moore
// will return 12 and 38
boyer_moore($pattern, $text);


Application

Boyer-Moore is one of the most used string searching algorithm in practice. It is intuitively clear where it can be useful, but yet again I’ll say only that this algorithm is considered as the mostly used in practice for search and replace operations in text editors.



Published at DZone with permission of Stoimen Popov, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)