I've been working on web based projects built mainly with PHP and JavaScript, where I mostly use Zend Framework and jQuery. I am interested in any webpage optimizations techniques - for a faster web! Stoimen is a DZone MVB and is not an employee of DZone and has posted 96 posts at DZone. You can read more from them at their website. View Full User Profile

Algorithm of the Week: Rabin-Karp String Searching

04.03.2012
| 20962 views |
  • submit to reddit

Brute force string matching is a very basic sub-string matching algorithm, but it’s good for some reasons. For example it doesn’t require preprocessing of the text or the pattern. The problem is that it’s very slow. That is why in many cases brute force matching can’t be very useful. For pattern matching we need something faster, but to understand other sub-string matching algorithms let’s take a look once again at brute force matching.

In brute force sub-string matching we checked every single character from the text with the first character of the pattern. Once we have a match between them we shift the comparison between the second character of the pattern with the next character of the text, as shown on the picture below.




This algorithm is slow for mainly two reasons. First, we have to check every single character from the text. On the other hand even if we find a match between a text character and the first character of the pattern we continue to check step by step (character by character) every single symbol of the pattern in order to find whether it is in the text. So is there any other approach to find whether the text contains the pattern?

In fact there is a “faster” approach. In this case, in order to avoid the comparison between the pattern and the text character by character, we’ll try to compare them all at once, so we need a good hash function. With its help we can hash the pattern and check against hashed sub-strings of the text. We must be sure that the hash function is returning “small” hash codes for larger sub-strings. Another problem is that for larger patterns we can’t expect to have short hashes. But besides this the approach should be quite effective compared to the brute force string matching.

This approach is known as Rabin-Karp algorithm.

Overview

Michael O. Rabin and Richard M. Karp came up with the idea of hashing the pattern and to check it against a hashed sub-string from the text in 1987. In general the idea seems quite simple, the only thing is that we need a hash function that gives different hashes for different sub-strings. Said hash function, for instance, may use the ASCII codes for every character, but we must be careful for multi-lingual support.




The hash function may vary depending on many things, so it may consist of ASCII char to number converting, but it can also be anything else. The only thing we need is to convert a string (pattern) into some hash that is faster to compare. Let’s say we have the string “hello world”, and let’s assume that its hash is hash(‘hello world’) = 12345. So if hash(‘he’) = 1 we can say that the pattern “he” is contained in the text “hello world”.  So in every step, we take from the text a sub-string with the length of m, where m is the pattern length. Thus we hash this sub-string and we can directly compare it to the hashed pattern, as in the picture above.

Implementation

So far we saw some diagrams explaining the Rabin-Karp algorithm, but let’s take a look at its implementation here, in this very basic example where a simple hash table is used in order to convert the characters into integers. The code is PHP and it’s used only to illustrate the principles of this algorithm.

function hash_string($str, $len)
{
	$hash = '';
 
	$hash_table = array(
		'h' => 1,
		'e' => 2,
		'l' => 3,
		'o' => 4,
		'w' => 5,
		'r' => 6,
		'd' => 7,
	);
 
	for ($i = 0; $i < $len; $i++) {
		$hash .= $hash_table[$str{$i}];
	}
 
	return (int)$hash;
}
 
function rabin_karp($text, $pattern)
{
	$n = strlen($text);
	$m = strlen($pattern);
 
	$text_hash = hash_string(substr($text, 0, $m), $m);
	$pattern_hash = hash_string($pattern, $m);
 
	for ($i = 0; $i < $n-$m+1; $i++) {
		if ($text_hash == $pattern_hash) {
			return $i;
		}
 
		$text_hash = hash_string(substr($text, $i, $m), $m);
	}
 
	return -1;
}
 
// 2
echo rabin_karp('hello world', 'ello');

Multiple Pattern Match

It’s great to say that the Rabin-Karp algorithm is great for multiple pattern match. Indeed its nature is supposed to support such functionality, which is its advantage in comparison to other string searching algorithms.

Complexity

The Rabin-Karp algorithm has the complexity of O(nm) where n, of course, is the length of the text, while m is the length of the pattern. So where is it compared to brute-force matching? Well, brute force matching complexity is O(nm), so as it seems there’s not much of a gain in performance. However, it’s considered that Rabin-Karp’s complexity is O(n+m) in practice, and that makes it a bit faster, as shown on the chart below.




Note that the Rabin-Karp algorithm also needs O(m) preprocessing time.

Application

As we saw Rabin-Karp is not much faster than brute force matching. So where we should use it?

3 Reasons Why Rabin-Karp is Cool

1. Good for plagiarism, because it can deal with multiple pattern matching!





2. Not faster than brute force matching in theory, but in practice its complexity is O(n+m)!

3. With a good hashing function it can be quite effective and it’s easy to implement!

2 Reasons Why Rabin-Karp is Not Cool

1. There are lots of string matching algorithms that are faster than O(n+m)

2. It’s practically as slow as brute force matching and it requires additional space

Final Words

Rabin-Karp is a great algorithm for one simple reason – it can be used to match against multiple patterns. This makes it perfect to detect plagiarism even for larger phrases.

Published at DZone with permission of Stoimen Popov, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Yaron Levy replied on Sun, 2012/06/10 - 10:10am

This article is incorrect in two important ways:

(1) The hash function provided is O(m), where m is the length of the pattern, and therefore the algorithm, as presented, is O(mn).

(2) The algorithm returns any string with the same hash as the pattern.

Students who might otherwise be confused by this article are invited to check out the Wikipedia article on the technique; its coverage is much more accurate.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.