Big Data/Analytics Zone is brought to you in partnership with:

I am working as a software developer in Happiest Minds Technologies (http://www.happiestminds.com/). I am passionate about Data Warehousing and Big Data. Rishav is a DZone MVB and is not an employee of DZone and has posted 12 posts at DZone. You can read more from them at their website. View Full User Profile

Logstash, ElasticSearch and Kibana Integration for Clickstream Weblog Ingestion

01.25.2014
| 6415 views |
  • submit to reddit

In this blog I am going to showcase how we can develop a quick and easy demo application for clickstream weblog ingestion, search and visualization. We will achieve this using Logstash for log ingestion, store it in ElasticSearch and make a pretty dashboard using Kibana. For clickstream weblog I am using logs data from ECML/PKDD 2005 Discovery Challenge .

You can download complete weblogs after registering there. These weblog are delimited by semi-colon (;) and have below mentioned fields in order:

  • shop_id
  • unixtime
  • client ip
  • session
  • visted page
  • referrer

Here are some sample log lines:

15;1075658406;212.96.166.162;052ecba084545d8348806f087b6e09bb;/ls/?&id=77&view=2,6,31&pozice=20;http://www.shop5.cz/ls/?id=77
12;1075658406;195.146.109.248;05aa4f4db0162e5723331042eb9ce8a7;/ct/?c=153;http://www.shop3.cz/
12;1075658407;212.65.194.144;86140090a2e102f1644f29e5ddadad9b;/ls/?id=34;http://www.shop3.cz/ct/?c=155
14;1075658407;80.188.85.210;f07f39ec63abf67f965684f3fa5729c4;/findp/?&id=63&view=1,2,3,14,20,15&p_14=nerez;http://www.shop4.cz/ls/?&p_14=nerez&id=63&view=1%2C2%2C3%2C14%2C20%2C15&&aktul=0
17;1075658408;194.108.232.234;be0970125c4eb3ee4fc380be05b3c58f;/ls/?id=155&sort=45;http://www.shop7.cz/ls/?id=155&sort=45
12;1075658409;62.24.70.41;851f20e644eb8bf82bfdbe4379050e2e;/txt/?c=734;http://www.shop3.cz/onakupu/

For creating this demo we need to create a logstash configuration file (lets name this file clickstream.conf) which consists of specifying inputs, filters and outputs. The clickstream.conf file looks like:

input { 
  file {# path for clickstream log
    path =>"/home/rishav.rohit/Desktop/clickstream/_2004_02_01_19_click_stream.log"# define a type for all events handeled by this input
    type =>"weblog"
    start_position =>"beginning"# the clickstream log is in character set ISO-8859-1
    codec => plain {charset =>"ISO-8859-1"}
  }
}

filter {
  csv {# define columns present in weblog
    columns =>[shop_id, unixtime, client_ip, session, page, referrer]
    separator =>";"
  }
  grok {# get visited page and page parameters
    match =>["page","%{URIPATH:page_visited}(?:%{URIPARAM:page_params})?"]
     remove_field =>["page"]
  }
  date {# as we are getting unixtime field in epoch seconds we will convert it to normal timestamp
    match =>["unixtime","UNIX"]
  }
  geoip {# this will convert ip to longitude-latitude using GeoLiteCity database from Maxmind
    source =>"client_ip"
    fields =>["latitude","longitude"]
    target =>"geoip"
    add_field =>["[geoip][coordinates]","%{[geoip][longitude]}"]
    add_field =>["[geoip][coordinates]","%{[geoip][latitude]}"]
  }
  mutate {# this will convert geoip.coordinates to float values
    convert =>["[geoip][coordinates]","float"]}
  }

output {# store output in local elasticsearch cluster
  elasticsearch {
    host =>"127.0.0.1"
  }
}
To start logstash agent we run below command:

java -jar logstash-1.2.2-flatjar.jar agent -f clickstream.conf

Now the log will be indexed to ElasticSearch. A sample record in ElasticSearch looks like this:

{

    _index: logstash-2004.02.01
    _type: logs
    _id: I1N0MboUR0O1O3RZ-qXqnw
    _version:1
    _score:1
    _source:{
        message:[14;1075658407;80.188.85.210;f07f39ec63abf67f965684f3fa5729c4;/findp/?&id=63&view=1,2,3,14,20,15&p_14=nerez;http://www.shop4.cz/ls/?&p_14=nerez&id=63&view=1%2C2%2C3%2C14%2C20%2C15&&aktul=0 ]@timestamp:2004-02-01T18:00:07.000Z@version:1
        type: weblog
        host: HMECL000315.happiestminds.com
        path:/home/rishav.rohit/Desktop/clickstream/_2004_02_01_19_click_stream.log
        shop_id:14
        unixtime:1075658407
        client_ip:80.188.85.210
        session: f07f39ec63abf67f965684f3fa5729c4
        referrer: http://www.shop4.cz/ls/?&p_14=nerez&id=63&view=1%2C2%2C3%2C14%2C20%2C15&&aktul=0
        page_visited:/findp/
        page_params:?&id=63&view=1,2,3,14,20,15&p_14=nerez
        geoip:{
            latitude:50.08330000000001
            longitude:14.466700000000003
            coordinates:[14.46670000000000350.08330000000001]
        }
     }
}

So we have parsed complex log message into simpler components and converted fields like unixtime to datetime, IP to latitude-longitude and got page visited by the client. Now using Kibana we can quickly make dashboard with these panel


This histogram shows page landings count for different time interval.


This is a map pointing to client locations


And in this table we can see different attributes for each clickstream.

Published at DZone with permission of Rishav Rohit, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)