Mitch Pronschinske is a Senior Content Analyst at DZone. That means he writes and searches for the finest developer content in the land so that you don't have to. He often eats peanut butter and bananas, likes to make his own ringtones, enjoys card and board games, and is married to an underwear model. Mitch is a DZone Zone Leader and has posted 2576 posts at DZone. You can read more from them at their website. View Full User Profile

Indexing Big Data on AWS with Solr

06.03.2012
| 4381 views |
  • submit to reddit


Scott Stults of OpenSource Connections shows how you can build a scalable search platform capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done.

Amazon Web Services offers a quick and easy way to build a scalable search platform. Flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.

Download Session Slides.

Comments

Daniel Slazer replied on Tue, 2012/06/12 - 12:11pm

The sending is done via a gps device. When there is connectivity and the device sends the data and there is no corruption. Problem starts when there is no connectivity (gprs) to the server then it will buffer the data. The moment it sees connectivity it will try to send each of the data in the buffer. It wont send another data till it receive the confirmation @@ of a previous message. We notice the corruption is normally occuring to the first few characters of the message string. Is it possible to confirm that the flush is sending it properly. We have checked the device log and sent message from the devices looks perfectly fine. Only when is read by the buffer reader it shows corruption.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.