Big Data/Analytics Zone is brought to you in partnership with:

Istvan Szegedi is an IT Technical Architect at Vodafone UK. He has been working at Hewlett-Packard, Nokia Networks, Google, Morgan Stanley and Vodafone. He holds certificates such as Sun Certified System Administrator, Sun Certified Java Programmer, Sun Certified Web Component Developer, Salesforce.com Certified Force.com Developer, TOGAF Certified Enterprise Architect. As a big fan of mobile and cloud computing, he likes to believe that these technologies will eventually push aside the desktop/client-server architecture Istvan is a DZone MVB and is not an employee of DZone and has posted 38 posts at DZone. You can read more from them at their website. View Full User Profile

Integrating R with Cloudera Impala for Real-Time Queries on Hadoop

11.26.2013
| 11989 views |
  • submit to reddit

Introduction

Cloudera Impala supports low-latency, interactive queries on Hadoop data sets either stored in Hadoop Distributed File System (HDFS) or HBase, the distributed NoSQL database for Hadoop. Impala’s notion is to use Hadoop as a storage engine but move away from MapReduce algorithms. Instead, Impala uses distributed queries, a concept inherited from massive parallel processing databases. As a result, Impala supports SQL-like query languange (in the same way way as Apache Hive), but can execute the queries 10-100 times fasters than Hive that converts them into MapReduce. You can find more details on Impala in one of the previous posts.

is one of the most popular open source statistical computing and graphical software. It can work with various data sources from comma separated files to web contents referred by URLs to relational databases to NoSQL (e.g. MongoDB or Cassandra) and Hadoop.

Thanks to the generic Impala ODBC driver, R can be integrated with Impala, too. The solution will provide fast, interactive queries running on top of Hadoop data sets and then the data can be further processed or visualized within R.

Cloudera Impala ODBC drivers

As we can see in the diagram below, Impala runs on the top of dataset stored in HDFS or HBase and the users can interact with it in multiple ways.

impala-architecture

One option is to use impala-shell which is part of the impala package and provides a command line interface. Other option is to use Hue (Cloduera’s Hadoop User Experience product) that is a web browser based UI offering a query editor among other functions that is capable of run queries against Pig, Hive or Impala.The third option is to use ODBC driver and connect some of the well-known popular BI tools to Impala.

Cloudera provides connectors for some of the most popular leading analytics and data visualization tools such as Tableau, QlikView or Microstrategy. It can also offer a generic ODBC driver that can be used to connect various tools. This is the software component that we will use in the post to demonstrate how to integrate R with Cloudera Impala.

Install R, RStudio Server, Impala ODBC and RODBC

Impala installation was covered in this post. To install R on a Linux environment (for now Fedora 19 will be used ) we need to execute the following commands:

# Install EPEL package - EPEL stands for Extra package for Enterprise Linux
$ sudo rpm -ivh http://mirror.chpc.utah.edu/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm

$ sudo yum install R
================================================================================
 Package                Arch           Version               Repository    Size
================================================================================
Updating:
 R                      x86_64         3.0.2-1.el6           epel          20 k
Updating for dependencies:
 R-core                 x86_64         3.0.2-1.el6           epel          46 M
 R-core-devel           x86_64         3.0.2-1.el6           epel          90 k
 R-devel                x86_64         3.0.2-1.el6           epel          19 k
 R-java                 x86_64         3.0.2-1.el6           epel          20 k
 R-java-devel           x86_64         3.0.2-1.el6           epel          20 k
 libRmath               x86_64         3.0.2-1.el6           epel         116 k
 libRmath-devel         x86_64         3.0.2-1.el6           epel          24 k

Transaction Summary
================================================================================
Upgrade       8 Package(s)

R comes with a command line interpreter but if you want to have a more convenient development environment, you may prefer to use RStudio. RStudio has a desktop version as well as a web browser based alternative called RStudio Server. They can be downloaded for free from RStudio website. We will use RStudio Server in this post.

To install RStudio Server, you need to execute the following command:

$ sudo yum install --nogpgcheck rstudio-server-0.97.551-x86_64.rpm

================================================================================
 Package           Arch   Version         Repository                       Size
================================================================================
Installing:
 rstudio-server    x86_64 0.97.551-1      /rstudio-server-0.97.551-x86_64  96 M
...

Transaction Summary
===================================================================
Install       3 Package(s)

To ensure that Impala ODBC driver will work and RODBC package can be installed within R (as it will be shown later on in this post), you also need to install unixODBC and unixODBC-devel packages:

$ sudo yum install unixODBC
$ sudo yum install unixODBC-devel

Finally you have to install Cloudera Impala ODBC driver. You can download it from Cloudera website, as of writing the post the latest version is 2.5 (the driver file name is ClouderaImpalaODBC-2.5.5.1005-1.el6.x86_64.rpm). To install Impala ODBC driver, you need to run the following command after downloading the driver:

$ yum --nogpgcheck localinstall ClouderaImpalaODBC-2.5.5.1005-1.el6.x86_64.rpm

Impala ODBC driver requires a couple of files configured properly (the driver package has templates files embedded that needs to be edited and copied to the correct directory). The two key configuration files are odbc.init and cloudera.impalaodbc.ini.

odbc.ini should look something like this:

[Impala]
# Description: DSN Description.
# This key is not necessary and is only to give a description of the data source.
Description=Cloudera ODBC Driver for Impala (64-bit) DSN

# Driver: The location where the ODBC driver is installed to.
Driver=/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so

# Values for HOST, PORT, KrbFQDN, and KrbServiceName should be set here.
# They can also be specified on the connection string.
HOST=localhost
PORT=21050
Database=default

In cloudera.impalaodbc.ini configuration file we have the following settings:

# SimbaDN / unixODBC
ODBCInstLib=libodbcinst.so

In addition, we need to define the environment variables as follows:

$ export LD_LIBRARY_PATH=/usr/local/lib:/opt/cloudera/impalaodbc/lib/64
$ export ODBCINI=/etc/odbc.ini
$ export SIMBADN=/etc/cloudera.impalaodbc.ini

The final step is to install RODBC package for R. You can do it using R command line tool:

$ R
>install.packages("RODBC")
Published at DZone with permission of Istvan Szegedi, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Ju Peter replied on Sat, 2013/11/30 - 2:18am

I am weak in java.. But your post is very impressive.... you provide all information in detail... Thanks for sharing this information with us... I am waiting for your next post... I hope your next post will help me to improve my knowledge of java....

humidity temperature transmitter

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.