The amount of data that is available to anyone with a computer and an internet connection is difficult to comphrehend and increasing daily. At the same time, the cost of the hardware needed to store large quantities of this available data is dropping consistently. But while the availability of data and the means to store it may be fading concerns, the ability to sort it, find what's relevant and manipulate it effectively is proving to be a formidable challenge.
The Apache Mahout project, a sub-project of Lucene, is being designed to tackle the challenge head-on. Mahout is a toolset that provides scalable algorithms for developers working on applications that digest massive amounts of data. If you'd like to know more, there are a couple of helpful talks that you should check out by Isabel Drost, software engineer at Nokia Berlin and co-founder of Mahout.
First you might want to check out Drost's presentation at the 2010 Devoxx conference. It runs about half an hour and provides a helpful an introduction to the problems that Mahout is trying to solve. Drost talks about basics of machine learning the kinds of applications that make it work well. Turning to Mahout, she again offers a useful overview and subsequently introduces each of the algorithms that was available at that time.
More recently Drost has presented on Mahout at ApacheCon NA 2011. This time she follows up her introduction to Mahout with the topic of integrating Mahout into your applications. You'll also hear about changes that have recently been implemented in Mahout, including the addition of several algorithms, performance improvements and better APIs for integration.
Check it out and see what you think!