Cloudera Impala supports low-latency, interactive queries on Hadoop data sets either stored in Hadoop Distributed File System (HDFS) or HBase, the distributed NoSQL database for Hadoop. Impala’s notion is to use Hadoop as a storage engine but move away from MapReduce algorithms. Instead, Impala uses distributed queries, a concept inherited from massive parallel processing databases. As a result, Impala supports SQL-like query languange (in the same way way as Apache Hive), but can execute the queries 10-100 times fasters than Hive that converts them into MapReduce. You can find more details on Impala in one of the previous posts.
R is one of the most popular open source statistical computing and graphical software. It can work with various data sources from comma separated files to web contents referred by URLs to relational databases to NoSQL (e.g. MongoDB or Cassandra) and Hadoop.
Thanks to the generic Impala ODBC driver, R can be integrated with Impala, too. The solution will provide fast, interactive queries running on top of Hadoop data sets and then the data can be further processed or visualized within R.
Cloudera Impala ODBC drivers
As we can see in the diagram below, Impala runs on the top of dataset stored in HDFS or HBase and the users can interact with it in multiple ways.
One option is to use impala-shell which is part of the impala package and provides a command line interface. Other option is to use Hue (Cloduera’s Hadoop User Experience product) that is a web browser based UI offering a query editor among other functions that is capable of run queries against Pig, Hive or Impala.The third option is to use ODBC driver and connect some of the well-known popular BI tools to Impala.
Cloudera provides connectors for some of the most popular leading analytics and data visualization tools such as Tableau, QlikView or Microstrategy. It can also offer a generic ODBC driver that can be used to connect various tools. This is the software component that we will use in the post to demonstrate how to integrate R with Cloudera Impala.
Install R, RStudio Server, Impala ODBC and RODBC
Impala installation was covered in this post. To install R on a Linux environment (for now Fedora 19 will be used ) we need to execute the following commands:
# Install EPEL package - EPEL stands for Extra package for Enterprise Linux $ sudo rpm -ivh http://mirror.chpc.utah.edu/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm $ sudo yum install R ================================================================================ Package Arch Version Repository Size ================================================================================ Updating: R x86_64 3.0.2-1.el6 epel 20 k Updating for dependencies: R-core x86_64 3.0.2-1.el6 epel 46 M R-core-devel x86_64 3.0.2-1.el6 epel 90 k R-devel x86_64 3.0.2-1.el6 epel 19 k R-java x86_64 3.0.2-1.el6 epel 20 k R-java-devel x86_64 3.0.2-1.el6 epel 20 k libRmath x86_64 3.0.2-1.el6 epel 116 k libRmath-devel x86_64 3.0.2-1.el6 epel 24 k Transaction Summary ================================================================================ Upgrade 8 Package(s)
R comes with a command line interpreter but if you want to have a more convenient development environment, you may prefer to use RStudio. RStudio has a desktop version as well as a web browser based alternative called RStudio Server. They can be downloaded for free from RStudio website. We will use RStudio Server in this post.
To install RStudio Server, you need to execute the following command:
$ sudo yum install --nogpgcheck rstudio-server-0.97.551-x86_64.rpm ================================================================================ Package Arch Version Repository Size ================================================================================ Installing: rstudio-server x86_64 0.97.551-1 /rstudio-server-0.97.551-x86_64 96 M ... Transaction Summary =================================================================== Install 3 Package(s)
To ensure that Impala ODBC driver will work and RODBC package can be installed within R (as it will be shown later on in this post), you also need to install unixODBC and unixODBC-devel packages:
$ sudo yum install unixODBC $ sudo yum install unixODBC-devel
Finally you have to install Cloudera Impala ODBC driver. You can download it from Cloudera website, as of writing the post the latest version is 2.5 (the driver file name is ClouderaImpalaODBC-220.127.116.115-1.el6.x86_64.rpm). To install Impala ODBC driver, you need to run the following command after downloading the driver:
$ yum --nogpgcheck localinstall ClouderaImpalaODBC-18.104.22.1685-1.el6.x86_64.rpm
Impala ODBC driver requires a couple of files configured properly (the driver package has templates files embedded that needs to be edited and copied to the correct directory). The two key configuration files are odbc.init and cloudera.impalaodbc.ini.
odbc.ini should look something like this:
[Impala] # Description: DSN Description. # This key is not necessary and is only to give a description of the data source. Description=Cloudera ODBC Driver for Impala (64-bit) DSN # Driver: The location where the ODBC driver is installed to. Driver=/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so # Values for HOST, PORT, KrbFQDN, and KrbServiceName should be set here. # They can also be specified on the connection string. HOST=localhost PORT=21050 Database=default
In cloudera.impalaodbc.ini configuration file we have the following settings:
# SimbaDN / unixODBC ODBCInstLib=libodbcinst.so
In addition, we need to define the environment variables as follows:
$ export LD_LIBRARY_PATH=/usr/local/lib:/opt/cloudera/impalaodbc/lib/64 $ export ODBCINI=/etc/odbc.ini $ export SIMBADN=/etc/cloudera.impalaodbc.ini
The final step is to install RODBC package for R. You can do it using R command line tool:
$ R >install.packages("RODBC")