Big Data/Analytics Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 524 posts at DZone. You can read more from them at their website. View Full User Profile

Clojure: Mahout’s ‘entropy’ Function

11.08.2012
| 3200 views |
  • submit to reddit

 

As I mentioned in a couple of previous posts Jen and I have been playing around with Mahout random forests and for a few hours last week we spent some time looking through the code to see how it worked.

In particular we came across an entropy function which is used to determine how good a particular ‘split’ point in a decision tree is going to be.

I quite like the following definition:

The level of certainty of a particular decision can be measured as a number from 1 (completely uncertain) to 0 (completely certain).

Information Theory (developed by Claude Shannon 1948) defines this value of uncertainty as entropy, a probability-based measure used to calculate the amount of uncertainty.

For example, if an event has a 50/50 probability, the entropy is 1. If the probability is 25/75 then the entropy is a little lower.

The goal in machine learning is to get a very low entropy in order to make the most accurate decisions and classifications.

The function reads like this:

 private static double entropy(int[] counts, int dataSize) {
    if (dataSize == 0) {
      return 0.0;
    }
 
    double entropy = 0.0;
    double invDataSize = 1.0 / dataSize;
 
    for (int count : counts) {
      if (count == 0) {
        continue; // otherwise we get a NaN
      }
      double p = count * invDataSize;
      entropy += -p * Math.log(p) / LOG2;
    }
 
    return entropy;
  }

We decided to see what the function would look like it was written in Clojure and it was clear from looking at how the entropy variable is being mutated that we’ll need to do a reduce over a collection to get our final result.

In my first attempt at writing this function I started with the call to reduce and then worked out from there:

(defn individual-entropy [x data-size]
  (let [p (float (/ x data-size))]
    (/ (* (* -1 p) (Math/log p)) (Math/log 2.0))))
 
(defn calculate-entropy [counts data-size]
  (if (= 0 data-size)
    0.0
    (reduce
     (fn [entropy x] (+ entropy (individual-entropy x data-size)))
     0
     (remove (fn [count] (= 0 count)) counts))))

Jen was pretty much watching on with horror the whole time I wrote this function but I was keen to see how our approaches differed so I insisted she allow me to finish!

We then moved onto Jen’s version where instead of writing the code all in one go like I did, we would try to reduce the problem to the point where we wouldn’t need to pass a custom anonymous function to reduce but could instead pass a +.

This meant we’d need to run a map over the counts collection to get the individual entropy values first and then add them all together.

(defn calculate-entropy [counts data-size]
  (->>  counts
       (remove #(= 0 %))
       (map #(individual-entropy % data-size))
       (reduce +)))

Here we’re using the threading operator to make the code a bit easier rather than nesting functions as I had done.

Jen also showed me a neat way of rewriting the line with the remove function to use a set instead:

(defn calculate-entropy [counts data-size]
  (->>  counts
       (remove #{0})
       (map #(individual-entropy % data-size))
       (reduce +)))

I hadn’t seen this before although Jay Fields has a post showing a bunch of examples of using sets and maps as functions.

In this case if the set is applied to 0 the value will be returned:

user> (#{0} 0)
0

But if the set is applied to a non 0 value we’ll get a nil back:

user> (#{0} 2)
nil

So if we apply that to a collection of values we’d see the 0s removed:

user> (remove #{0} [1 2 3 4 5 6 0])
(1 2 3 4 5 6)




Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)