Mitch Pronschinske is a Senior Content Analyst at DZone. That means he writes and searches for the finest developer content in the land so that you don't have to. He often eats peanut butter and bananas, likes to make his own ringtones, enjoys card and board games, and is married to an underwear model. Mitch is a DZone Zone Leader and has posted 2573 posts at DZone. You can read more from them at their website. View Full User Profile

Lucene 'MoreLikeThis' Example Code

10.18.2011
| 8254 views |
  • submit to reddit

We've all seen how important it is for sites like StackOverflow and DZone Links to prevent duplicate questions or links.  Lucene is a well-equipped to handle this sort of problem using it's 'MoreLikeThis' feature.  The following is a post from Mark Shead's blog that gives you a good use case and example where this feature can be used.

I was recently working on a simple application where the user will enter famous quotations.  Obviously we want to avoid duplicates so I needed a way to check for quotations that were substantially similar before a new quote was added to the database.

The idea was to show the top 5 most similar quotes before letting the user save the new quotation to the db. I used Lucene for this which allowed me to punt on the more difficult task of figuring out if two quotes were similar or not. I left that up to Lucene and only had to worry about how to get my information in and out of Lucene in a usable manner.

Below is the interesting method that uses Lucene to build an index of all the quotes in the system and then returns the five quotes that are most similar to the new quote text.  Obviously creating a new index each time a quote is added isn’t particularly efficient, but makes it easier to demonstrate how it works and processor efficiency isn’t much of an issue with this particular task.

public List<Quote> getSimilarQuotes() throws CorruptIndexException, IOException {
 
    String quoteText = quote.getText();
    logger.info("creating RAMDirectory");
    RAMDirectory idx = new RAMDirectory();
    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_31, new StandardAnalyzer(Version.LUCENE_31));
    IndexWriter writer = new IndexWriter(idx, indexWriterConfig);
 
    List<Quote> quotes =  session.createCriteria(Quote.class).list();
 
    //Create a Lucene document for each quote and add them to the
    //RAMDirectory Index.  We include the db id so we can retrive the
    //similar quotes before returning them to the client.
    for (Quote quote : quotes) {
        Document doc = new Document();
        doc.add(new Field("contents", quote.getText(),Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("id", quote.getId().toString() ,Field.Store.YES, Field.Index.ANALYZED));
        writer.addDocument(doc);
    }
 
    //We are done writing documents to the index at this point
    writer.close();
 
    //Open the index
    IndexReader ir = IndexReader.open(idx);
    logger.info("ir has " + ir.numDocs() + " docs in it");
    IndexSearcher is = new IndexSearcher(idx, true);
 
    MoreLikeThis mlt = new MoreLikeThis(ir);
 
    //lower some settings to MoreLikeThis will work with very short
    //quotations
    mlt.setMinTermFreq(1);
    mlt.setMinDocFreq(1);
 
    //We need a Reader to create the Query so we'll create one
    //using the string quoteText.
    Reader reader = new StringReader(quoteText);
 
    //Create the query that we can then use to search the index
    Query query = mlt.like( reader);
 
    //Search the index using the query and get the top 5 results
    TopDocs topDocs = is.search(query,5);
    logger.info("found " + topDocs.totalHits + " topDocs");
 
    //Create an array to hold the quotes we are going to
    //pass back to the client
    List<Quote> foundQuotes = new ArrayList<Quote>();
    for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
        //This retrieves the actual Document from the index using
        //the document number. (scoreDoc.doc is an int that is the
        //doc's id
        Document doc = is.doc( scoreDoc.doc );
 
        //Get the id that we previously stored in the document from
        //hibernate and parse it back to a long.
        String idField =  doc.get("id");
        long id = Long.parseLong(idField);
 
        //retrieve the quote from Hibernate so we can pass
        //back an Array of actual Quote objects.
        Quote thisQuote = (Quote)session.get(Quote.class, id);
 
        //Add the quote to the array we'll pass back to the client
        foundQuotes.add(thisQuote);
    }
 
    return foundQuotes;
}
People Found This When Looking For:
Tags:

Comments

Matthew Schmidt replied on Tue, 2011/10/18 - 8:35am

Great article. Does anyone know how this handles documents that have a lot of text? We found that if you try to pass in say the body of a question as the query, Lucene has some serious issues when just searching. Does MoreLikeThis do something different?

Robert Craft replied on Thu, 2012/01/26 - 5:56am

Hi Mathew, As I have experienced with Lucene, it has no such issue. It is based on indexing and will create index for all documents. It might take time while creating index but it is only one time process. Its searching is really fast.

Spring Security

Pradyumna Dandwate replied on Wed, 2012/02/15 - 3:10am

Hey Mitch,

Awesome post!

Just a small suggestion,  while adding "id" field to the document, you don't need to mark it as ANALYZED. May help improving performance.

 Thanks for writing!

Pradyumna

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.