Big Data/Analytics Zone is brought to you in partnership with:

Mitch Pronschinske is the Lead Research Analyst at DZone. Researching and compiling content for DZone's research guides is his primary job. He likes to make his own ringtones, watches cartoons/anime, enjoys card and board games, and plays the accordion. Mitch is a DZone Zone Leader and has posted 2578 posts at DZone. You can read more from them at their website. View Full User Profile

Configuring Mahout Clustering Jobs

  • submit to reddit

Configuring Mahout Clustering Jobs, Frank Scholten, JTeam, Eurocon 2011 from Lucene Revolution on Vimeo.

For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.

Topics include

  • Clustering introduction
  • Clustering in Mahout
  • Text pre-processing & analysis
  • Tag cloud demo
  • Tips & tricks

Download session slides.


Ioan Eugen Stan replied on Fri, 2012/01/20 - 7:56am

Nice talk, you have some interesting ideas there.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.