Big Data/Analytics Zone is brought to you in partnership with:

David has worked at companies including HP and Sun Microsystems as Product Marketer and Director. He is now an executive at Lucid Imagination, a leading company in Solr and Lucene-based enterprise search solutions. David has posted 4 posts at DZone. You can read more from them at their website. View Full User Profile

Hadoop lets you store everything; with Lucene/Solr and more

10.28.2011
| 9204 views |
  • submit to reddit

This month’s Wired Magazine features a story on the roots of Hadoop at Yahoo and the three companies vying to drive its commercial frontiers farther forward faster: Hortonworks (Apache Lucene Eurocon Barcelona Keynote Video now available, see below), MapR, and Cloudera. MapR CEO John Schroeder sums it up:

If I can get a terabyte drive for $100 — or less if I buy in bulk — and I can get cheap processing power and network bandwidth to get to that drive, why wouldn’t I just just keep everything?” he says. “Hadoop lets you keep all your raw data and ask questions of it in the future.”


Yahoo, while otherwise lamented in the press for its business model woes, has done this with an array of applications from spam-hunting (retraining the model every few hours) to auto-categorization and user  content mapping, running 5 million jobs a month across over 40 thousand servers and 170 petabytes of storage (a mere $17M worth of disk, enough to keep at most maybe a half-dozen enterprise storage sales guys busy. Multi-billion enterprise storage companies are in a tizzy). With the leverage this affords, it’s no surprise that Ebay has increased their Hadoop footprint 5x to over 2500 servers in the last year. Nor it is surprising that Eric Baldeschwieler, Keynote speaker at Apache Lucene Eurocon 2011 in Barcelona last week, predicts that 50% of the world’s data will be stored on Hadoop within 5 years:

KEYNOTE: Architecting the Future of Big Data & Search, Eric Baldeschwieler, Hortonworks CEO|Apache Lucene Eurocon Barcelona 2011 from Lucene Revolution on Vimeo.


So step one: store it all, and map/reduce to your heart’s content, cranking through key-value abstractions that produce insights you just couldn’t get running it in and out of a relational database (though with HDFS and Hive, the constructs of filesystem and query retrieval from the conventional data world are not out of reach). At Lucid, we’ve helped streamline that process, for example, with built-in HDFS connectors from LucidWorks.

But that doesn’t answer the question about how to animate the virtuous cycle of insights available once you get all that data stored. Here’s where the search equation gets interesting. If you know exactly what you are looking for every time, it’s one thing to write some jobs that extract a particular trend or insight. But when you keep everything, can you know everything a priori? Of course not. Grant Ingersoll’s talk sets forth a powerful portfolio of tools centered on Lucene/Solr

These two talks between them will give you a solid foundation for why applying search to big data matters to end users and businesses alike. Better awareness driven by the  search backed by real data, combined with enablement of developers who can better fine tune access and retrieval, and the agility to fill the white spaces of relationships between available information — what you didn’t know you didn’t know.

More talks from Barcelona are here. We’ll touch on the talk from Michael Busch of Twitter soon.


Source: http://www.lucidimagination.com/blog/2011/10/27/hadoop-lets-you-store-everything-with-lucenesolr-and-more-you-can-find-what-youre-looking-for/
Published at DZone with permission of its author, David Fishman.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)