Thursday Code Puzzler: Needles In the Haystack
By now you should be familiar with our Thursday Code Puzzler slot. The idea is simple: solve the coding problem as efficiently as you can, in any language or framework that you think is suitable.
Note: Even though there really is nothing stopping you from finding a solution to this on the internet, try to keep honest, and come up with your own answer. It's all about the participation!
Find The Needles in the Haystack
This week it's another classic string based question - to find the number of occurances of a given String (needle) in another (haystack). So the method signature would look like:
public int findNeedleOccurances(String needle, String haystack)
Try to make the solution as efficient as possible, and if you do get to run it post up your timings, or even better, the Big O notation for your solution.






Comments
Chirag Visavadia replied on Thu, 2012/05/17 - 6:03am
public static final int findNeedleOccurances(String needle, String haystack) { final int needleLength = needle.length() - 1; final int haystackLength = haystack.length() - 1; int needleCount = 0; /** * Iterate through entire haystack string */ for (int i = 0; i <= haystackLength; i++) { /** * If not enough char to compare then break : reduce unnecessary * looping */ if (haystackLength - i < needleLength) break; /** * Iterate through entire needle string */ for (int j = 0; j <= needleLength; j++) { /** * If first char and last char not match then break : reduce * unnecessary looping */ if (haystack.charAt(i + j) != needle.charAt(j) || haystack.charAt(i + needleLength) != needle.charAt(needleLength)) break; /** * Avoid to compare last char as we already compare in above * condition */ if (needleLength - 1 == j) { needleCount++; break; } } } return needleCount; }Ian Mcintosh replied on Thu, 2012/05/17 - 8:59am
def findNeedles(needle: String, haystack: String) = {
haystack.sliding(needle.size).filter(_==needle).size
}
Mike P(Okidoky) replied on Thu, 2012/05/17 - 12:05pm
(I see that using Chromium on Linux I now have access to a wysiwyg editor. Although it has bugs, I could work around it.)
For this sort of thing, I prefer operating on arrays directly, because it produces shorter easier faster performing code. Coming from an assmebler background way back when, I know that --a >= 0 is faster than ++a < end, so I made everything go backwards.
I had some other ideas on this that I didn't pursue. Some sort of hashing technique might be possible. If the search is performed repeatedly and the haystack stays the same, it should be possible to create a series of hash codes for each offset, at either all lenghths, or a set of lenghts. Then on a search the hash would be calculated on the needle and then all the candidates could be looked up more quickly, after which point you'd only be comparing char regions on those candidates. Or (and/or) perhaps it might be possible to break things down into some sort of tree/hierarchy.
A repeated search isn't a part of the requirement, so I'm leaving it with a brute force method with a focus on performance. The code would translate to assembler quite nicely as well, if performance is a requirement. In that case I would not want to compare 16 bit characters one by one, but using MMX instructions or larger 64 bit chunks, padded with the left overs.
Another idea just popped up. I could represent the strings differently. Instead of comparing character by character, I could take bit 0 (LSB) from 64 characters and put it in a long. The next long would represent bit 1 of those 64 characters. This would be "clipped" to the limit of what characters are available. Using an "and" instruction I could make it compare up to 64 characters with one significant bit. The chances that all bits 0 of all characters to compare match, almost always only happens when there is a match, probably. So two "ands" and one compare of two longs would probably eliminate an unmatched string. If matched, it proceeds with either a conventional string compare character by character, or do more of these bit-ized comparisons. The code below doesn't do any of this, but I'd like to share my thoughts about this, because I might not get around implementing this.
public static int findNeedleOccurances(String needle, String haystack) { char[] n = needle.toCharArray(), h = haystack.toCharArray(); int nl = needle.length(), count = 0; char ne = n[--nl]; outer: for (int hi = haystack.length(); hi > 0; ) { if (h[--hi] != ne) continue; for (int hc = hi, nc = nl; nc > 0; ) { if (h[--hc] != n[--nc]) continue outer; } count++; } return count; }Vijay Nathani replied on Thu, 2012/05/17 - 11:07am
Groovy is given below
def findNeedleOccurances(String needle, String haystack) {return haystack.count(needle)
}
println findNeedleOccurances("a","baca")
Basically, it is a builtin function in string class itself. So no code is needed.
Mike P(Okidoky) replied on Thu, 2012/05/17 - 11:42am
in response to:
Vijay Nathani
Vijay Nathani replied on Thu, 2012/05/17 - 10:40pm
in response to:
Mike P(Okidoky)
Groovy 1.8.4 in the file "src\main\org\codehaus\groovy\runtime\DefaultGroovyMethods.java" codes this function as:
public static int count(String self, String text) { int answer = 0; for (int idx = 0; true; idx++) { idx = self.indexOf(text, idx); if (idx >= 0) { ++answer; } else { break; } } return answer; }Mike P(Okidoky) replied on Sun, 2012/05/20 - 12:46am
in response to:
Vijay Nathani
That Groovy code has a bug, because haystack="abbabba", needle="abba", results in 1, while in fact it should result in 2. "abba" appears in "abbabba" twice, not once.
Vijay Nathani replied on Sun, 2012/05/20 - 3:31pm
in response to:
Mike P(Okidoky)
The groovy code does not have the bug. I ran it. The following code prints 2, not 1.
#!/usr/bin/env groovy def findNeedleOccurances(String needle, String haystack) { return haystack.count(needle) } println findNeedleOccurances("abba","abbabba")