Big Data/Analytics Zone is brought to you in partnership with:

Enterprise Architect in HCL Technologies a $7Billion IT services organization. My role is to work as a Technology Partner for large enterprise customers providing them low cost opensource solutions around Java, Spring and vFabric stack. I am also working on various projects involving, Cloud base solution, Mobile application and Business Analytics around Spring and vFabric space. Over 23 yrs, I have build repository of technologies and tools I liked and used extensively in my day to day work. In this blog, I am putting all these best practices and tools so that it will help the people who visit my website. Krishna is a DZone MVB and is not an employee of DZone and has posted 64 posts at DZone. You can read more from them at their website. View Full User Profile

Sharding, Scaling, Data Storage Methodologies, and More: Insights on Big Data

01.16.2014
| 7397 views |
  • submit to reddit

OLTP: Refers to Online Transaction Processing

It’s a bunch of programs that used to help and manage transactions (insert, update, delete and get) oriented applications. Most of the OLTP applications are faster because the database is designed using 3NF. OLTP systems are vertically scalable.

OLAP:  Refers to Online Analytical Processing

The OLAP is used for Business Analytics, Data Warehousing kind of transactions. The data processing is pretty slow because it enables users to analyze multidimensional data interactively from multiple data sources. The database schema used in OLAP applications are STAR (Facts and Dimensions, normalization is given a pass, redundancy is the order of the day).  OLAP systems are horizontally scaling.

OLTP vs. OLAP: There are some major difference between OLTP and OLAP.

Data Sharding:

Traditional way of database architecture implements vertical scaling that means splitting the table into number of columns and keeping them separately in physical or logically grouping (tree structure). This will lead into performance trouble when the data is growing. Need to increase the memory, CPU and disk space each and every time when we hit the performance problem.

To eliminate the above problem Data Sharding or Shared nothing concept is evolved, in which the database are scaled horizontally instead of vertically using the master/slave architecture by breaking the database into shards and spreading those into a number of vertically scalable servers.

The Data Sharding concept is discussed in detail in this link.

MPP (Massive Parallel Processing systems): Refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel.

MPP is also known as cluster computing or shared nothing architecture discussed above.

The examples for MPP are TeraData, GreenPlum.

Vertical and Horizontal Scaling:

Horizontal scaling means that you scale by adding more machines into your pool of resources where vertical scaling means that you scale by adding more power (CPU, RAM) to your existing machine.

In a database world horizontal-scaling is often based on partitioning of the data i.e. each node contains only part of the data, in vertical-scaling the data resides on a single node and scaling is done through multi-core i.e. spreading the load between the CPU and RAM resources of that machine.

With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool – Vertical-scaling is often limited to the capacity of a single machine.

Below are the differences between vertical and horizontal scaling.

CAP Theorem: The description for the CAP Theorem is discussed in this link.

Some insight on CAP: http://www.slideshare.net/ssachin7/scalability-design-principles-internal-session

Greenplum:  Greenplum Database is a massively parallel processing (MPP) database server based on PostgreSQL open-source technology. MPP (also known as shared nothing architecture) refers to systems with two or more processors which cooperate to carry out an operation – each processor with its own memory, operating system and disks.

The high-level overview of GreenPlum is discussed in this link.

Some Interesting Links:

http://www.youtube.com/watch?v=ph4bFhzqBKU,

http://www.youtube.com/watch?v=-7CpIrGUQjo

Hbase:  HBase is the Hadoop database. It is distributed, scalable, big data storage. It is used to provide real-time read and write access to large database which uses cluster (master/slave) architecture to store/retrieve data.

  • Hadoop is open source software developed by Apache to store large data.
    • MapReduce is used for distributed processing of large data sets on clusters (master/slave). MapReduce takes care of scheduling tasks, monitoring them and re-executing any failed tasks. The primary objective of Map/Reduce is to split the input data set into independent chunks and send to the cluster. The MapReduce sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system.
    • The Hadoop Distributed File System(HDFS) primary objective of HDFS is to store data consistently even in the presence of failures. HDFS uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves). The HDFS cluster consists of one or more slaves who actually contain the file system and a master server manages the file system namespace and regulates access to files.

Hbase documentation is provided in this link.

Some Interesting Links:

http://www.youtube.com/watch?v=UkbULonrP2o

http://www.youtube.com/watch?v=1Qx3uvmjIyU

http://www-01.ibm.com/software/data/infosphere/hadoop/hbase/

Published at DZone with permission of Krishna Prasad, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)