Fascinated by the "craft" of software development, Eric Pugh has been healthily involved in the open source world as a developer, comitter, and user for the past five years. He is a member of the Apache Software Foundation, and lately has been mulling over how we move from read/write web to the read/write/share web. In biotech, financial services and defence IT, he has helped European and American companies develop coherent strategies for embracing open source software. As a speaker has has advocated the advantages of Agile practices in software development. Eric became involved in Solr when he submitted the patch SOLR-284 for Parsing Rich Document types such as PDF and MS Office formats that became the single most popular patch as measured by votes! The patch was subsequently cleaned up and enhanced by three other individuals, demonstrating the power of the Free/Open Source Model to build great code collaboratively. SOLR-284 was eventually refactored into Solr Cell as part of Solr version 1.4 Eric co-authored "Solr 1.4 Enterprise Search Server", the first book on Solr. he blogs at http://www.opensourceconnections.com/blog/. Eric is a DZone MVB and is not an employee of DZone and has posted 8 posts at DZone. You can read more from them at their website. View Full User Profile

Search is the Dominant Metaphor for Working with Big Data

  • submit to reddit

I went to LuceneRevolution to test out my assertion that

Search is the dominant metaphor for working with Big Data

and based on the conversations that I had, that assertion holds water.

As Grant Ingersoll pointed out in his keynote, the basic plumbing required for Big Data: storage, distributed processing, cheap price tag, have been met. What we are missing is the actual ability to make decisions based on the information contained in our Big Data sets. We are still caught up in the navel-gazing activity of “how much raw data have I collected”, and aren’t focusing on “Should I make decision X or Y based on my data.” There is a huge gap between those who write MapReduce jobs, and those who need access to the results of those jobs. Processed results aren’t enough, and we shouldn’t need to file the equivalent of a FOIA request with our IT department to gain access to the raw data. Search-based applications, also known as Search, Discovery, and Analytics (SDA) fill the gap between the developers and data scientists working with the raw data and the business users attempting to make data-driven discussions.

Search engines were the original “Big Data” ten years ago. Then the rise of Google led to the search market bifurcating into efforts related to internal Enterprise search, and e-commerce search. The importance of Search seemed to dwindle, witness the declining attendance count at conferences like Enterprise Search Summit. But with the accelerating growth in data, aka “Big Data,” search in the last few years has moved from a basic input box to the feature that can make or break your application.

Other thoughts:

  • Met a number of ex-Endeca folks. I’m hoping that the Lucene community takes advantage of these people who’ve done cool things with other search engines like Endeca, and bring some of their great ideas into Lucene and Solr. New blood is good.
  • This continues to be the “Year of Big Data”. I’m looking forward to tighter integration between the search and the big data communities.
    Lots of folks are building custom QueryParsers to solve specific problems. Be interesting to see how much of this becomes generalized and open sourced.
  • Microsoft seems to have dropped their knee-jerk reaction against Java, and is working to make it easy to run the Big Data ecosystem of projects on their cloud platform Azure.
  • People are anxious to use Lucene 4. A strength of the Solr open source project is the incredible level of unit testing that is there. Go ahead and use it! If your IT manager doesn’t like to use unreleased code, tell’em to come talk to me!
  • ElasticSearch continues to have some great mind share, but suffers from the much smaller committer community of 1! The competition is keeping Solr honest.
  • Mark Miller gave a big pitch for the RandomizedTesting which is an extraction of Solr/Lucene’s awesome unit testing framework into something generic. Anything that makes testing complex systems simpler is good.

It was a great conference, very thought provoking, great people and conversations. LuceneRevolution 2012 continues to set the bar for hard core technical conferences. Attendance is a no-brainer if you are working with Lucene!

Published at DZone with permission of Eric Pugh, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


Fahmeed Nawaz replied on Tue, 2012/06/12 - 11:11am

Could anyone, please, add here info on how to ensure getLastError() is called on each connection when pool is destructed? TBH, I'd expect very much such a flush/sync to be done by default during connection closing!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.