Big Data/Analytics Zone is brought to you in partnership with:

Victor works on the Angular team at Google. He is interested in functional programming, the Web platform, and client-side applications. Being a language nerd he spends a lot of my time playing with Smalltalk, JS, Dart, Scala, Haskell, Clojure, Ruby, and Ioke. Victor is a DZone MVB and is not an employee of DZone and has posted 45 posts at DZone. You can read more from them at their website. View Full User Profile

How to Build a Data-Warehouse in 4 weeks, Part 1

08.23.2011
| 9676 views |
  • submit to reddit

First of all, there is no reason to be afraid. Not only big corporations having terabytes of data can use data-warehouses. If you have a complex domain model and a few databases with millions of records here and there it might be useful to create a separate database instance for performing analytical queries and generating reports. Also, don’t believe people who say that you will need years to build a simple DW. You don’t need more than 4 weeks to develop DW that will bring some value to your business.

Architecture

It’s very important to think about the architecture of your DW as it will define tools and procedures that will be used. To make it simple: ‘Inmon VS Kimball‘ is the question you need to answer:

  • In Inmon’s paradigm all information is stored in 3rd normal form. Basically, the schema of your DW will be similar (more or less) to the schema of your transactional database. Inmon’s paradigm is good for creating large centralized data warehouses that will serve the whole enterprise. It’s a top down approach that requires more upfront design. The performance in this model isn’t as good as in Kimball’s model because data is normalized.

  • In Kimball’s paradigm you start building several databases (known as data marts) that will serve individual departments of your business. In this model DW is a combination of all data marts. The dimensional model, that is used to store data, simplifies the schema significantly and, as a result, gives you much better performance. As for me I prefer Kimball’s paradigm as you don’t have to design the whole DW upfront. You can start with creating a single data mart and it can be done by one person in a month.

Read about Kimball’s model

The best thing you can do is to buy this book “The Data Warehouse Toolkit, 2nd Edition: The Complete Guide to Dimensional Modeling”. It’s a good overview of the dimension model, which will help you if you don’t have any experience in building a DW. It helped me a lot. Highly recommend it to everyone interested in the topic.

 

Database

Choosing a DBMS for your DW is a very important step. And it’s not an easy decision to make. There are two options you can choose from. If your dataset is rather small you can choose a familiar row-oriented database (such as Postgres). Just having data denormalized will improve performance. But if your database is huge it’s time to put on big boy’s pants and look at column-oriented storages:

  • Row-oriented database are stable, reliable and familiar to everyone. They are perfect for transactional operations but not so good for analytical queries.
  • Column-oriented databases are not very mature. As a result, they are hard to administrate, hard to find any documentation about, and they are not as reliable as row-oriented databases. It isn’t surprising taking into account the fact that these databases have been developing only for a few years. But they can give you something that MySQL can’t – performance. From my experience, they are AT LEAST 10 TIMES FASTER for complex analytical queries.

My advice would be to load some data into your favorite row-oriented database and check the performance. And ONLY if you are not happy with it go and try MonetDB, LucidDB, Infobright or some other column-oriented DBMS.

 

Column-Oriented DBMS

If you decided to go with a column-oriented database I can give you some advice about two databases I had experience with:

Infobright

The biggest benefit you’ll get if you choose to use Infobright is that fact that it’s a mysql engine. You can use your favorite driver to connect to it (so you can use any language, any platform) and you won’t have to learn a new dialect of SQL. It’s a very solid product but the free version has severe limitations - you can use only bulk upload. No inserts, updates or deletes.

LucidDB

The distribution includes it’s own jdbc driver. If you are a ruby or a python developer you are out of luck. In addition, their site doesn’t contain any decent documentation. It’s just a collection of reference-like wiki page, most of which contain only a brief description of some crazy parameters such as processorMaxBytes or databaseShadowLogIncrementSize. I am 100% sure that most of all luciddb users will never change them. On the other hand, the free version of LucidDB doesn’t have any limitations.

 

From http://victorsavkin.com/post/9209987806/how-to-build-a-data-warehouse-in-4-weeks-part-1

Published at DZone with permission of Victor Savkin, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Nicolas Bousquet replied on Mon, 2011/08/29 - 3:11am

It seems to me that MySQL is ofen not the best database or the one used for building a datawarehouse. Personnaly I have worked with Oracle RDBMS and indeed the performance was really really good. And if it's not enough you have many many options to optimize with partitionning, or in memory tables for temporary data.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.