Big Data/Analytics Zone is brought to you in partnership with:

Treasure Data's Big Data as-a-Service cloud platform enables data-driven businesses to focus their precious development resources on their applications, not on mundane, time-consuming integration and operational tasks. Our pre-built, multi-tenancy cloud platform is already in use by over 50 customers worldwide and is managing more than 200 billion rows of data and processing 130,000 jobs per day. Discover how Treasure Data can help you focus on your core business and benefit from the fastest time-to-answer service available. Sadayuki is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Five Criteria of Next Generation Data Warehouse

  • submit to reddit

A few months ago, Jeff Kelly published a comprehensive article that captured the current state of Big Data and Hadoop nicely and laid out the blueprint of next-generation cloud computing.

For cloud computing nerds like us, the article was a joy to read, but especially the section called “Next Generation Data Warehousing” caught our eyes. In that section, Kelly listed out five characteristics of next-generation data warehouses. Here, we want to see how our platform measures up against Kelly’s criteria.

1 Massively parallel processing, or MPP, capabilities: Next Generation Data Warehouses employ massively parallel processing, or MPP, that allow for the ingest, processing and querying of data on multiple machines simultaneously. The result is significantly faster performance than traditional data warehouses that run on a single, large box and are constrained by a single choke point for data ingest.

Check. We implemented a job queue (Perfect Queue) that sends our customers’ jobs across hundreds of machines on Amazon Web Services.

2 Shared-nothing architectures: A shared-nothing architecture ensures there is no single point of failure in Next Generation Data Warehousing environments. Each node operates independently of the others so if one machine fails, the others keep running. This is particularly important in MPP environments, in which, with sometimes hundreds of machines processing data in parallel, the occasional failure of one or more machines is inevitable.

Check. We take advantage of Hadoop MapReduce running on EC2 to process our customers’ jobs.

3 Columnar architectures: Rather than storing and processing data in rows, as is typical with most relational databases, most Next Generation Data Warehouses employ columnar architectures. In columnar environments, only columns that contain the necessary data to determine the “answer” to a given query are processed, rather than entire rows of data, resulting in split-second query results. This also means data does not need to be structured into neat tables as with traditional relational databases.

Check. We designed and implemented a columnar database sitting on top of Amazon S3.

4 Advanced data compression capabilities: Advanced data compression capabilities allow Next Generation Data Warehouses to ingest and store larger volumes of data than otherwise possible and to do so with significantly fewer hardware resources than traditional databases. A warehouse with 10-to-1 compression capabilities, for example, can compress 10 terabytes of data down to 1 terabyte. Data compression, and a related technique called data encoding, are critical to scaling to massive volumes of data efficiently.

Check. We achieve a 5-10x compression ratio. Columnar data storage helps with compression considerably, but our secret sauce is a binary serializer called MessagePack. MessagePack is space-efficient and incredibly fast to serialize and deserialize. One of our co-founders is the original author of MessagePack, and we use it extensively throughout our stack.

5 Commodity hardware: Like Hadoop clusters, most Next Generation Data Warehouses run on off-the-shelf commodity hardware (there are some exceptions to this rule, however) from Dell, IBM and others, so they can scale-out in a cost effective manner.

Check. Since our data warehouse sits on top of Amazon S3, this is certainly the case.

In conclusion, Treasure Data’s Cloud Data Warehouse covers the requirements of Next Generation Data Warehouse pretty well. We know Kelly’s criteria are no silver bullet, but it is definitely a vote of confidence in our product =)

Published at DZone with permission of Sadayuki Furuhashi, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)