Big Data/Analytics Zone is brought to you in partnership with:

Mitch Pronschinske is a Senior Content Analyst at DZone. That means he writes and searches for the finest developer content in the land so that you don't have to. He often eats peanut butter and bananas, likes to make his own ringtones, enjoys card and board games, and is married to an underwear model. Mitch is a DZone Zone Leader and has posted 2575 posts at DZone. You can read more from them at their website. View Full User Profile

Configuring Mahout Clustering Jobs

  • submit to reddit

Configuring Mahout Clustering Jobs, Frank Scholten, JTeam, Eurocon 2011 from Lucene Revolution on Vimeo.

For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.

Topics include

  • Clustering introduction
  • Clustering in Mahout
  • Text pre-processing & analysis
  • Tag cloud demo
  • Tips & tricks

Download session slides.


Ioan Eugen Stan replied on Fri, 2012/01/20 - 7:56am

Nice talk, you have some interesting ideas there.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.