I've been working on web based projects built mainly with PHP and JavaScript, where I mostly use Zend Framework and jQuery. I am interested in any webpage optimizations techniques - for a faster web! Stoimen is a DZone MVB and is not an employee of DZone and has posted 96 posts at DZone. You can read more from them at their website. View Full User Profile

Algorithm of the Week: Morris-Pratt String Searching

04.11.2012
| 25265 views |
  • submit to reddit

We saw that neither brute force string searching nor Rabin-Karp string searching are effective. However in order to improve some algorithm, first we need to understand its principles in detail. We know already that brute force string matching is slow and we tried to improve it somehow by using a hash function in the Rabin-Karp algorithm. The problem is that Rabin-Karp has the same complexity as brute force string matching, which is O(mn).

Obviously we need a different approach, but to come with a different approach let’s see what’s wrong with brute force string searching. By taking a closer look at its principles we can answer the question.

In brute force matching we checked each character of the text with the first character of the pattern. In case of a match we shifted the comparison between the second character of the pattern and the next character of the text. The problem is that in case of a mismatch we must go several positions back in the text. Well in fact this technique can’t be optimized.


As you can see in the picture above, the problem is that once there is a mismatch we must rollback and start comparing from a position in the text that has been explored already. In our case we have checked the first, second, third and fourth letters, where there is a mismatch between the pattern and the text and then … we go back and start comparing from the second letter of the text.

This is completely useless, because we already know that the pattern begins with the letter “a” and no such letter happens to be between positions 1 and 3. So how can we improve this redundancy?

Overview

The answer to that question came to James H. Morris and Vaughan Pratt in 1977 when they described their algorithm, which, by skipping lots of useless comparisons, is more effective than brute force string matching. Let’s see it in detail. The only thing is to use the information gathered during the comparisons of the pattern and a possible match, as in the picture below.


To do that, first we have to preprocess the pattern in order to get possible positions for next matches. Thus, after we start to find a possible match in case of a mismatch we’ll know exactly where we should jump in order to skip unusual comparisons.

Generating the Table of Next Positions

This is the tricky part in Morris-Pratt and that is how this algorithm overcomes the disadvantages of brute force string searching. Let’s see some pictures.


However, in case of a repeating character in the pattern, if we have a mismatch after that character a possible match must begin from this repeating character, as in the picture bellow.


Finally, if there are more than one repeating character in the text, the “next” table will show their position.


After we have this table of possible “next” positions we can start exploring the text for our pattern.


Implementation

Implementing Morris-Pratt isn’t difficult. First we have to preprocess the pattern and then perform the search. The following PHP code shows you how to do that.

/**
 * Pattern
 * 
 * @var string
 */
$pattern = 'mollis';
 
/**
 * Text to search
 * 
 * @var string
 */
$text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque eleifend nisi viverra ipsum elementum porttitor quis at justo. Aliquam ligula felis, dignissim sit amet lobortis eget, lacinia ac augue. Quisque nec est elit, nec ultricies magna. Ut mi libero, dictum sit amet mollis non, aliquam et augue';
 
/**
 * Preprocess the pattern and return the "next" table
 * 
 * @param string $pattern
 */
function preprocessMorrisPratt($pattern, &$nextTable)
{
	$i = 0;
	$j = $nextTable[0] = -1;
	$len = strlen($pattern);
 
	while ($i < $len) {
		while ($j > -1 && $pattern[$i] != $pattern[$j]) {
			$j = $nextTable[$j];
		}
 
		$nextTable[++$i] = ++$j;
	}
}
 
/**
 * Performs a string search with the Morris-Pratt algorithm
 * 
 * @param string $text
 * @param string $pattern
 */
function MorrisPratt($text, $pattern)
{
	// get the text and pattern lengths
	$n = strlen($text);
	$m = strlen($pattern);
	$nextTable = array();
 
	// calculate the next table
	preprocessMorrisPratt($pattern, $nextTable);
 
	$i = $j = 0;
	while ($j < $n) {
		while ($i > -1 && $pattern[$i] != $text[$j]) {
			$i = $nextTable[$i];
		}
		$i++;
		$j++;
		if ($i >= $m) {
			return $j - $i;
		}
	}
	return -1;
}
 
// 275
echo MorrisPratt($text, $pattern);

Complexity

This algorithm needs some time and space for preprocessing. Thus the preprocess of the pattern can be done in O(m), where m is the length of the pattern, while the search itself needs O(m+n). The good news is that you can do the preprocess only once and then perform the search as many times as you wish!

The following chart shows the complexity O(n+m) compared with O(nm) for 5 letter patterns.


Application


Why it’s cool

  1. It's searching complexity is O(m+n) which is faster than brute force and Rabin-Karp
  2. It’s fairly easy to implement


Why it isn’t cool

  1. It needs additional space and time – O(m) for pre-processing
  2. It can be optimized a bit (Knuth-Morris-Pratt)


Final Words

Obviously this algorithm is quite useful because it improves in some very elegant manner the brute force matching. On the other hand you must know that there are faster string searching algorithms like the Boyer-Moore algorithm. However the Morris-Pratt algorithm can be quite useful in many cases, so understanding its principles can be very handy.

 

Published at DZone with permission of Stoimen Popov, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)