Big Data/Analytics Zone is brought to you in partnership with:

Justin Bozonier is the Product Optimization Specialist at GrubHub formerly Sr. Developer/Analyst at Cheezburger. He's engineered a large, scalable analytics system, worked on actuarial modeling software. As Product Optimization Specialist he is currently leading split test design, implementation, and analysis. The opinions expressed here represent my own and not those of my employer. Justin is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Coding Quickie: Classification Tree Learning

10.09.2012
| 5691 views |
  • submit to reddit

Have you ever seen a large table of data and thought, "Somebody just vomited numbers on my monitor?" Sometimes it can be hard to make sense out of tables like this -- and even moreso when the tables are large. There's at least one machine learning technique that will help you make sense of the data: classification trees.

Classification trees work by pouring over your data column by column and then hierarchically grouping your data by the unique values in each column. As an example imagine if you had a table of data like this:

  
In code, I write it like this: 
dataset = {
  "headers"=>["Number of Wheels","Number of Doors","Miles/Gallon", "Type of Auto"],
  "data" => [
    [4, 4,'21-40', 'Car'],
    [2, 0,'41-60', 'Motorcycle'],
    [4, 2,'11-20', 'Truck'],
    [16, 2,'1-10', 'Big Rig'],
    [4, 2,'21-40', 'Car'],
    [8, 2,'1-10', 'Big Rig'],
    [4, 4,'41-60', 'Hybrid Car'],
  ]
}
How would you organize the data to make it more understandable? Note that this example is a bit impractical because the table is pretty small but the mechanics will work the same. One caveat: the algorithm is working off of already enumerated values in a table. There's no continuous values like 8.132, or names, or IDs. In trying to use this in a real world scenario I'm finding myself preprocessing my data to create the buckets so I can run this algorithm.  The classification algorithm I learned tonight starts by looking at each column of data and then deciding which seems like it will give us the most information gain. Once it determines that it partitions the table and carries on recursively until we've come up with a full classification tree. You might be wondering how exactly does the algorithm determine which split will give us the most information. That's thanks to a calculation known as Shannon Entropy. It's outside the scope of this post but feel free to read more about it here.  Here's the code I ported from Python to Ruby: 
def calculate_shannon_entropy data, header_index
  number_of_entries = data.length
  label_counts = {}
  data.each do |row|
    current_label = row[header_index]
    label_counts[current_label] = 0 if not label_counts.has_key? current_label
    label_counts[current_label] += 1
  end
  shannon_entropy = 0.0
  label_counts.each do |key, value|
    frequency = value.to_f / number_of_entries.to_f
    shannon_entropy -= frequency * Math.log(frequency,2)
  end
  shannon_entropy
end
With the algorith I created, the following hierarchy results:
{
  "Number of Doors"=>{
    4=>{
      "Number of Wheels"=>{
        4=>{
          "Miles/Gallon"=>{
            "21-40"=>{
              "Type of Auto"=>[["Car"]]
            },
            "41-60"=>{
              "Type of Auto"=>[["Hybrid Car"]]
            }
          }
        }
      }
    },
    0=>{
      "Number of Wheels"=>{
        2=>{
          "Miles/Gallon"=>{
            "41-60"=>{
              "Type of Auto"=>[["Motorcycle"]]
            }
          }
        }
      }
    },
    2=>{
      "Number of Wheels"=>{
        4=>{
          "Miles/Gallon"=>{
            "11-20"=>{
              "Type of Auto"=>[["Truck"]]
            },
            "21-40"=>{
              "Type of Auto"=>[["Car"]]
            }
          }
        },
        16=>{
          "Miles/Gallon"=>{
            "1-10"=>{
              "Type of Auto"=>[["Big Rig"]]
            }
          }
        },
        8=>{
          "Miles/Gallon"=>{
            "1-10"=>{
              "Type of Auto"=>[["Big Rig"]]
            }
          }
        }
      }
    }
  }
}
This is one of the few problems that very neatly fits a recursive solution. Why this one? Well a couple reasons:
  1. You'll notice we're building a tree from this hierarchy. Trees and many of the solutions that involve them are self similar meaning, whatever you do at one level of the tree you'll need to do at every level. In this case, that would be partitioning our dataset into many sub-datasets.
  2. The recursion depth is fairly well limited by the context of the problem. You'll only ever run into it on datasets with columns that number in the thousands. If that's you, you'll need to unwind the recursion into a stack based design. Have fun! :)
This code is pretty tacky for Ruby but I'm posting it anyway. Suggest how I can clean it up! 
def categorize_dataset dataset
  column_count = dataset["headers"].length
  if column_count == 1
    return {dataset["headers"].first => dataset["data"]}
  end

  min_entropy_value = 99
  best_index_to_split = 0
  
  (0..column_count-1).each do |column_index|
    entropy_value = calculate_shannon_entropy dataset["data"], column_index
    if entropy_value < min_entropy_value
      min_entropy_value = entropy_value
      best_index_to_split = column_index
    end
  end

  puts "Splitting on column index #{dataset["headers"][best_index_to_split]} with entropy of #{min_entropy_value}"
  the_split_dataset = split_the_dataset dataset, best_index_to_split

  split_header = dataset["headers"][best_index_to_split]
  tree = {
    split_header => the_split_dataset
  }

  the_split_dataset.each do |value_key, value_dataset|
    tree[split_header][value_key] = categorize_dataset value_dataset
  end

  tree
end
To give credit where it's due, I sat down with Machine Learning in Action tonight, and while I didn't copy the algorithm word for word, it was extremely helpful. This is one of the few books on an advanced topic that seems to able to convey the author's knowledge very well. Some of the more experienced among you might notice that the way I handled evaluating the right way to slice the data is different from the standard way (at least when compared to my book). I'm okay with that, but I might clean it up and straighten it out if the mood strikes me.    
Published at DZone with permission of Justin Bozonier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)