NoSQL Zone is brought to you in partnership with:

Mitch Pronschinske is a Senior Content Analyst at DZone. That means he writes and searches for the finest developer content in the land so that you don't have to. He often eats peanut butter and bananas, likes to make his own ringtones, enjoys card and board games, and is married to an underwear model. Mitch is a DZone Zone Leader and has posted 2569 posts at DZone. You can read more from them at their website. View Full User Profile

Mahout, HBase Among Six Apache Graduates

05.04.2010
| 15024 views |
  • submit to reddit
Today the Apache Software Foundation graduated five sub-projects and one incubator project en masse to Top-Level Projects.  Mahout, Nutch, Tika, Avro, and HBase were the five graduated sub-projects, and the recently donated Apache Traffic Server was promoted from the Apache Incubator.  Each of the sub-projects have now been given autonomy as fully-endorsed, standalone projects because they have proven that they can be self-governed.

Traffic Server
The Apache Traffic Server was donated around five months ago in November 2009.  The high performance HTTP/1.1 caching proxy server was donated by Yahoo.  Its no surprise that Traffic Server made it to the top level in only five months.  The server has been used since 2002 by Yahoo to serve about 400TB of data per day!  The developers say that the software is capable of handling over 75k requests per second per server.  The project committers plan on building native IPv6 support, full 64-bit software, and support for non-Linux Unix systems.

Mahout
The Mahout project is still below version 1.0, but it has already gained a good deal of interest from data-driven application developers.  The project supplies a collection of scalable machine-learning (A.I.) algorithm implementations, which include clustering, collaborative filtering, classification, feature reduction, and data mining algorithms.  These implementations are built on top of Apache's MapReduce framework, Hadoop.  Mahout has been a sub-project of Apache Lucene since 2008.

Tika
Tika is a lightweight, embeddable toolkit for advanced language detection and analysis.  Tika uses MIME standards and provides rapid unification of existing parser libraries.  It's been a Lucene sub-project since 2008 and is used in many Lucene projects including, Nutch, Mahout, and Solr.  Tika is used by NASA, Day Software, and the Internet Archive.

Nutch
Nutch is a modular, web searching engine that uses web-specifics such as a crawler, parsers for HTML, a link-graph database, and other document formats.  Nutch enables the creation of plugins for things like querying, clustering, data retrieval, media-type parsing, and more.  After a 100 million page demo system was created with Nutch, the project graduated from the incubator to the Lucene project in 2005.

Avro
Avro is a system for fast data serialization that has rich and dynamic schemas in its processing.  It has a compact binary data format with features for persistence, remote procedure calls, and simple dynamic language integration.  The Avro project was formerly a sub-project of Apache Hadoop.

HBase
HBase is a NoSQL data store based on Google's BigTable.  The data store adds random read/write access to the Hadoop stack, extending offline processing capabilities and enabling realtime serving of very large datasets. The project's goal is the hosting of big tables -- billions of rows X millions of columns -- running atop commodity hardware.  HBase was a sub-project of Hadoop since 2007.

It's been a busy year already for Apache; and an extremely successful one.  Along with today's graduations, five other projects have been promoted to the top level this year.  Apache UIMA, an analysis system for unstructured data, and Apache Shindig, an OpenSocial container, were two significant projects that got promoted.  Apache Click, a JEE web app framework that graduated to TLP, has gotten some Java developers excited.  Apache has also had two of its most high-profile projects graduated this year: Apache Cassandra and Apache Subversion.  Subversion is, of course, one of the most popular version control systems available, and Cassandra is the red hot NoSQL data store that everyone's talking about.

With the six graduates today and the five graduates so far this year, that makes eleven new Top Level Apache projects - and the year's not even halfway finished!  Imagine what could happen  in the next seven months.