Big Data/Analytics Zone is brought to you in partnership with:

Mitch Pronschinske is the Lead Research Analyst at DZone. Researching and compiling content for DZone's research guides is his primary job. He likes to make his own ringtones, watches cartoons/anime, enjoys card and board games, and plays the accordion. Mitch is a DZone Zone Leader and has posted 2576 posts at DZone. You can read more from them at their website. View Full User Profile

Apache Mahout Tackles A.I.

  • submit to reddit
Artificial intelligence is a term frequently associated with science fiction, not software development.  However, A.I. is becoming increasingly viable as a business tool.  In development, A.I. is more commonly referred to as "machine learning".  Writing a machine learning system can be very profitable if it's done well.   In September, an independent developer won $1million for building a movie recommendation engine for NetFlix. Flight Caster received strong praises for its flight delay prediction system.  These types of machine learning systems are not easy to build, but this year, Apache started working on a new project that would provide the tools needed for building a scalable machine learning system.  The project is named Apache Mahout and it recently released version 0.2, which is the first usable release. 

Mahout was started by the developers of the Apache Lucene project.  The name Mahout comes from the Hindi word for an elephant driver.  The term was chosen because of Mahout's association with Apache Hadoop, which has an elephant logo.  Earlier this year, the Lucene developers decided to create machine learning libraries and algorithms on top of the Apache's data systems, such as Hadoop.  The goal of Mahout is to create a scalable machine learning solution with a commercially friendly license and an active community. 

Mahout currently supports four use cases:
  • Clustering takes - groups text documents with similar topics.
  • Classification - assigns categories to an unlabeled document by learning from existing categorized documents.
  • Frequent itemset mining - identifies items that frequently appear together in item groups such as shopping cart contents.
  • Recommendation mining - learns from user behavior and recommends related items; the same case as the NetFlix movie recommendation engine.

Version 0.2 of Mahout includes updates for Hadoop 0.20.x and cleaner code.  The contributors have added API changes and performance enhancements to the collaborative filtering engine.  Other highlights include K-nearest-neighbor and SVD recommenders, Latent Dirichlet Allocation, random forests, and frequent pattern mining using parallel FP growth.

Mahout is still in its early stages, but the 0.2 version is a first step toward easier creation of machine learning systems.  When Mahout's machine learning libraries and algorithms become more mature, developers may not have to start from scratch like the developer who built the NetFlicks recommendation engine. 


Mark Unknown replied on Mon, 2009/11/30 - 2:45pm

I am looking into a potential project where the client wants to take documents and "automatically" extract "concepts" from them.  Previously they have been doing "key word" searches.  I know this is not much detail (I am not really sure what the client really expects) but do you think Mahout could be used for this?  I am looking at a few other Java AI type APIs too.

Mitch Pronschinske replied on Mon, 2009/11/30 - 11:27pm in response to: Mark Unknown

I would contact Grant Ingersoll, the Lucene PMC chair.  He's a very friendly guy and would probably know the capabilities of Mahout at this stage.

Mark Unknown replied on Tue, 2009/12/01 - 7:00pm in response to:

Thx mitch!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.