Software developer and frequent open-source contributor. Writing mostly for .NET, but also Java and C/C++. Really likes fiddling with data, texts especially, so he frequently finds himself working on databases or search engines, usually combining both. Itamar is a DZone MVB and is not an employee of DZone and has posted 31 posts at DZone. You can read more from them at their website. View Full User Profile

FastVectorHighlighter Issues Revisited

06.28.2012
| 5269 views |
  • submit to reddit

In a previous post I described how to use FVH to highlight contents which went through filters / readers like HTMLStripCharFilter in the analysis process. As DIGY in the comments spotted right away, my approach was all wrong. Yes, I knew any CharFilter or Tokenizer implementation would store term positions and offsets that take into account any skips done in the content, but since it didn't work for me I didn't care to look any deeper and just made that work around, and then ran to tell.

So, don't use that. Instead, rely on your analyzer to store positions and offsets and on FVH to use them correctly when highlighting. As it happens, the custom analyzers I used suffered from a nasty bug that was not allowing them to consider skips. Now that I fixed that, it all works like a charm.

However, two issues still remained. First, since my stored fields contain HTML, the fragments may contain HTML tags as well, sometimes partial ones. In many cases the fragment that will end up on your webpage would ruin the page layout because of a stubborn misplaced </div> tag that found its way to the fragment. Escaping all <'s and >'s is not a really good solution - you don't really want your fragments to contain ugly looking HTML tags.

The second issue was having duplicate content. I wanted to process the content more than once - index it with 2 or more analyzers, but didn't want to store it more than once since it was exactly the same content.  To still be able to highlight on those other fields as well, I needed FVH to allow me to specify a field name to pull the stored contents from.

Solving the first problem was quite easy, and required nothing more than a simple extension function. It is called on the fragment string after receiving it from FVH. To be on the safe side, I made sure to ask for a larger fragment than I originally intended, so even if a lot of HTML noise is present, some context will remain in the fragment:

public static string HtmlStripFragment(this string fragment)
{
    if (string.IsNullOrEmpty(fragment)) return string.Empty;
 
    var sb = new StringBuilder(fragment.Length);
    bool withinHtml = false, first = true;
    foreach (var c in fragment)
    {
        if (c == '>')
        {
            if (first) sb.Length = 0;
            withinHtml = false;
            first = false;
            continue;
        }
        if (withinHtml)
            continue;
        if (c == '<')
        {
            first = false;
            withinHtml = true;
            continue;
        }
        sb.Append(c);
    }
 
    // FVH was instantiated with "[b]" and "[/b]" as post- and pre- tags for highlighting,
    // so they won't get lost in translation
    return sb.Append("...").Replace("[b]", "<b>").Replace("[/b]", "</b>").ToString();
}

 The second issue was solved by subclassing FragmentsBuilder, only this time it was a bit less intrusive:

public class CustomFragmentsBuilder : BaseFragmentsBuilder
{
    public string ContentFieldName { get; protected set; }
 
    /// <summary>
    /// a constructor.
    /// </summary>
    public CustomFragmentsBuilder()
    {
    }
 
    public CustomFragmentsBuilder(string contentFieldName)
    {
        ContentFieldName = contentFieldName;
    }
 
    /// <summary>
    /// a constructor.
    /// </summary>
    /// <param name="preTags">array of pre-tags for markup terms</param>
    /// <param name="postTags">array of post-tags for markup terms</param>
    public CustomFragmentsBuilder(String[] preTags, String[] postTags)
        : base(preTags, postTags)
    {
    }
 
    public CustomFragmentsBuilder(string contentFieldName, String[] preTags, String[] postTags)
        : base(preTags, postTags)
    {
        ContentFieldName = contentFieldName;
    }
 
    /// <summary>
    /// do nothing. return the source list.
    /// </summary>
    public override List<WeightedFragInfo> GetWeightedFragInfoList(List<WeightedFragInfo> src)
    {
        return src;
    }
 
    protected override Field[] GetFields(IndexReader reader, int docId, string fieldName)
    {
        var field = ContentFieldName ?? fieldName;
        var doc = reader.Document(docId, new MapFieldSelector(new[] {field}));
        return doc.GetFields(field); // according to Document class javadoc, this never returns null
    }
}

 And as always the usual disclaimer applies - this isn't necessarily the best way to do this, and I'd definitely like to hear of more elegant ways to achieve that if such exist.

Published at DZone with permission of Itamar Syn-hershko, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: