Big Data/Analytics Zone is brought to you in partnership with:

Justin Bozonier is the Product Optimization Specialist at GrubHub formerly Sr. Developer/Analyst at Cheezburger. He's engineered a large, scalable analytics system, worked on actuarial modeling software. As Product Optimization Specialist he is currently leading split test design, implementation, and analysis. The opinions expressed here represent my own and not those of my employer. Justin is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Algorithm of the Week: Genetic Algorithms, Part I - Chromosomes

05.28.2013
| 9851 views |
  • submit to reddit

Click for the live example of what I’ll be covering

What are Genetic Algorithms? (Review)

Chromosome: executable code
Fitness Function: quantifies quality of executable code
Mutations: introduces random novel changes into the code. Sometimes these are terrible, sometimes brilliant.
Breeding: helps the algorithm iteratively improve the chromosome towards a more correct solution.
Generation: a given iteration of N chromosomes. Competition for survival occurs within each generation.

The whole point of a genetic algorithm is to evolve a set of chromosomes testing them against a given fitness function in the hopes of finding a chromosome with a good enough solution.

The problem I’m tackling

I want to write an algorithm that can integrate a given math function. Not a definite integral. I’ve already written an article on how to achieve that with a Monte Carlo simulation. No, I mean for my algorithm to come up with a new function that is a good approximate indefinite integral for the equation provided. If you are familiar with indefinite integrals and all of the special cases and techniques you’ll recognize how hard this seems.

How do I plan on tackling it with genetic algorithms? The key lies in the fitness function. I plan to use the Monte Carlo definite integral calculator from my previous post to score the chromosomes which evolve. Using that my hope is that eventually a chromosome will happen upon either a great approximation OR an exactly correct solution.

I’ve never tried this before and I’ve not been able to find articles on others trying this either. So either it’s an interesting at least somewhat novel approach OR… it’s a terrible TERRIBLE idea. :)

Focusing on Chromosomes

This endeavor was a lot more work than I planned. It seems worth dealing with just one piece in this article and the rest in the next article. That means that this article will revolve around lexing and parsing a chromosome to build what’s known as an Abstract Syntax Tree that can be used to evaluate the chromosome for given values of “x”.

A long time ago I learned an interesting technique for codifying mathematical formula into chromosomes. It basically requires you to write out the formula from left to right as a string using individual letters for different mathematical operators. I’ve added a little tweak of my own to allow me to insert arbitrary numbers in the chromosomes. Here are a few simple examples and what their mathematical translation is:

  • “n0” -> 0
  • “+n1n2.1” -> 1 + 2.1
  • ”+n4qn5n11” -> 4 + (5 sqrt(11))

For the last example it might help to know what the letters mean. You may have guessed that any number preceded by an “n” is a constant. + and * represent addition and multiplication respectively. So what’s “q”? “q” represents the square root of a number.

Here’s a full set of definitions:

q = sqrt(x)
+ = addition
- = subtraction
/ = division
* = multiplication
s = sin(x)
c = cos(x)
t = tan(x)
^ = pow(x,y)
n[0-9]+ = [0-9]+
x = input value

What’s the x? That’s how I plan to make the formula that is generated by the algorithm a general use function. For example:

  1. “^x2” -> x^2
  2. “+n2q+^n1xn2” -> 2 + sqrt(x^2 + 1)

So when I evaluate the first chromosome at 2 I should get 4. Evaluating the second at 2 should get 4.236. You can cut to the live example here: Click here to open the example

How to evaluate the chromosomes

There are different ways to parse these kind of chromosomes. My way is to read them in from right to left. Let’s start with a simple example:

*+-n1n2n3n4

Steps:

  1. I see a constant with the value 4. I’ll write that down on my scratch pad: 
    1. Scratch pad: 4
  2. There’s a constant of 3. Writing it down 1. Scratch pad: 4,3
  3. There’s a constant of 2. 
    1. Scratch pad: 4,3,2
  4. Finally, the last constant, number 1. 
    1. Scratch pad: 4,3,2,1
  5. I get to my first operator. It’s subtraction and it takes two terms. A left term and a right term. I take the first two terms I found (the constants 4 and 3) and do the math. 3-4 is -1. 
    1. Scratch pad: 2,1,-1
  6. The next operator I hit is addition. I take the next two terms and add them to get 1+2 = 3 
    1. Scratch pad: -1,3
  7. Finally I’m left with the final operator. I multiply -1 and 3 and get -3 as my answer.

I won’t lie: This isn’t the easiest most straightforward calculation to do. While building out my parser I ran into issues with my own misunderstanding multiple times.

With this in mind let’s begin our discussion of what’s involved in parsing and executing a simple blob of code.

The steps to parsing

  1. Lexing- Lexing is where we determine how to group our characters of a string into “words”. For example, in the above example “n1” really just represents the number 1. After lexing the text “n1” will be turned into a small object that will have some metadata around what it is and what it’s value is.
  2. Parsing- This is where we take the output from our lexical analysis and form it into an executable tree.
  3. Evaluating- Here we have a tree and can recursively call some sort of evaluation function on the root node to get a result.

Lexing

When lexing we go through our input string character by character and just group related bits together if it helps us to do our job. In this case I really just would like to categorize the difference between operators, constants, and input constants. I’d also like to group all the characters of the constants into a single token to make life easier when I get to the parsing stage.

Here’s all of my lexer:

var lex_this = function(chromosome){
    var tokens = [];
    chromosome = chromosome.split('').slice(0);

    var buffer = '';
    for(var index=chromosome.length-1; index >= 0; index--){
      var character = chromosome[index];
      console.log(character);
      switch(character){
        case '*':
          tokens.push({'type':'operator', 'value':'*'});
          break;
        case '/':
          tokens.push({'type':'operator', 'value':'/'});
          break;
        case '+':
          tokens.push({'type':'operator', 'value':'+'});
          break;
        case '-':
          tokens.push({'type':'operator', 'value':'-'});
          break;
        case 'n':
          tokens.push({'type': 'constant', 'value':parseFloat(buffer)});
          buffer = '';
          break;
        case 'x':
          tokens.push({'type': 'input_constant'});
          break;
        case 'q':
          tokens.push({'type': 'operator', 'value':'q'});
          break;
        case 's':
          tokens.push({'type': 'operator', 'value':'s'});
          break;
        case 'c':
          tokens.push({'type': 'operator', 'value':'c'});
          break;
        case 't':
          tokens.push({'type': 'operator', 'value':'t'});
          break;
        case '^':
          tokens.push({'type': 'operator', 'value':'^'});
          break;
        default:
          var constant_pattern=/[0-9\.]/
          if(constant_pattern.test(character)){
            buffer = character + buffer;
            console.log('Buffer is now: ' + buffer);
          } else {
            throw 'Unexpected value of ' + character;
          }
          break;
      }
      
    }
    return tokens;
};

You might notice I’m iterating over it in reverse. That was easier for me personally. If you’d rather try it from the opposite direction that’s completely valid.

The key take away from this is going character by character and inferring whatever is easy for you to at the current level of abstraction. In more complex problems you might even introduce another lexer type phase before getting to your parser just because it’s easier to iterate over your list of tokens.

Parsing

Now that there is a token for each concept (operator, constant, input constant) it should be easier to create a tree from this information.

Everything in my definition list needs to be an evaluatable node in my tree. Even the constants. By converting everything to a node I can establish a consistent interface to evaluate them and use polymorphism to handle the specific details for each one.

Here is the complete parser code:

var parse_these = function(tokens){
  tokens = tokens.slice(0);

  var nodes_to_assemble = [];

  for(var index in tokens){
    var token = tokens[index];
    switch(token.type){
      case 'operator':
        switch(token.value){
          case '+':
            var right_term = nodes_to_assemble.shift();
            var left_term = nodes_to_assemble.shift();
            var operator_node = new AdditionOperatorNode(left_term, right_term);
            nodes_to_assemble.push(operator_node);
            break;
          case '-':
            var right_term = nodes_to_assemble.shift();
            var left_term = nodes_to_assemble.shift();
            var operator_node = new SubtractionOperatorNode(left_term, right_term);
            nodes_to_assemble.push(operator_node);
            break;
          case '*':
            var right_term = nodes_to_assemble.shift();
            var left_term = nodes_to_assemble.shift();
            var operator_node = new MultiplicationOperatorNode(left_term, right_term);
            nodes_to_assemble.push(operator_node);
            break;
          case '/':
            var right_term = nodes_to_assemble.shift();
            var left_term = nodes_to_assemble.shift();
            var operator_node = new DivisionOperatorNode(left_term, right_term);
            nodes_to_assemble.push(operator_node);
            break;
          case '^':
            var right_term = nodes_to_assemble.shift();
            var left_term = nodes_to_assemble.shift();
            var operator_node = new PowerOperatorNode(left_term, right_term);
            nodes_to_assemble.push(operator_node);
            break;
          case 'q':
            var term = nodes_to_assemble.shift();
            var operator_node = new SquareRootOperatorNode(term);
            nodes_to_assemble.push(operator_node);
            break;
          case 's':
            var term = nodes_to_assemble.shift();
            var operator_node = new SineOperatorNode(term);
            nodes_to_assemble.push(operator_node);
            break;
          case 'c':
            var term = nodes_to_assemble.shift();
            var operator_node = new CosineOperatorNode(term);
            nodes_to_assemble.push(operator_node);
            break;
          case 't':
            var term = nodes_to_assemble.shift();
            var operator_node = new TangentOperatorNode(term);
            nodes_to_assemble.push(operator_node);
            break;
        }
        break;
      case 'constant':
        var constant = new ConstantTermNode(token.value);
        nodes_to_assemble.push(constant);
        break;
      case 'input_constant':
        var input_constant = new InputConstantTermNode();
        nodes_to_assemble.push(input_constant);
        break;
      default:
        throw 'Unexpected token type ' + token.type;
        break;
    }
  }
  if(nodes_to_assemble.length > 1){
    throw 'Invalid AST. Ambiguous root node';
  }
  return nodes_to_assemble[0];
};

Again, I’m just working one by one but this time on my tokens instead of characters from a string. Notice I’m using the node type in the loop to group the operators, constants, and input constants together. I could add on a little more info about operators to enable me to reuse code amongst my two term operators. Also note how I’m basically just finding terms, throwing them into a queue and then using them to build new terms as I go? It’s the exact same process as the 7 steps above but instead of using a “scratch pad” I have a list of “nodes_to_assemble”. Javascript comes with queues supported. I can push onto an array and then use the shift function to get the first item in the array out and removed from it. This makes it so that when I come across an operator that wants two terms I just call shift() twice and use the results to create a new node. Once I have the new node created I just stick it back in the queue. From there our new node will be treated just like the constants and pulled into some other node to be evaluated.

There are some special cases to consider here as well. What happens if we come across a token that tells us to create a two term operator node but when we shift there is only one node in the list (or worse yet, none)? I haven’t added special handling for this yet but this would mean our chromosome was inproperly formatted. Since we will be randomly generating the chromosomes we will want to make our tree creation robust against errors but I’m saving that for later.

Another special case is when there are no more tokens left and only one node left in our nodes_to_assemble list. That’s actually what we want. It means we’re completely done and our tree has been created. We return that final node as the root node of our tree and call it a day.

Evaluating

Here’s a quick example of some different nodes I’ve created. Look for what’s similar and what’s different.

var SquareRootOperatorNode = function(node){
  return {
    'evaluate': function(input_constant){
      return Math.sqrt(node.evaluate(input_constant));
    }
  }
};
var SineOperatorNode = function(node){
  return {
    'evaluate': function(input_constant){
      return Math.sin(node.evaluate(input_constant));
    }
  }
};
var CosineOperatorNode = function(node){
  return {
    'evaluate': function(input_constant){
      return Math.cos(node.evaluate(input_constant));
    }
  }
};
var TangentOperatorNode = function(node){
  return {
    'evaluate': function(input_constant){
      return Math.tan(node.evaluate(input_constant));
    }
  }
};
var ConstantTermNode = function(constant_value){
  return {
    'evaluate': function(input_constant){
      return constant_value;
    }
  };
};
var InputConstantTermNode = function(){
  return {
    'evaluate': function(input_constant){
      return input_constant;
    }
  }
};

So. Similarities? Every node has an “evaluate” function on it that takes in a single parameter called “input constant”. Why is that? Well since we won’t know what the input constant is until run time AND we won’t know if a leaf on our tree needs the input constant provided to it, it’s easier to just give the input constant to everbody. Imagine we have a SquareRootOperatorNode. If we made it so that constants were called in a different way, we’d need to check IF the given node was a constant and then the operator would need special handling for it. This way, the operator can just pass down the “input_constant” and be ignorant of whether or not the value is actually used.


Note: All views expressed here are my own and not those of my employer.

Published at DZone with permission of Justin Bozonier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Geoffrey De Smet replied on Tue, 2013/05/28 - 4:30am

We've experimented with Genetic Algorithms to solve planning/scheduling problems in OptaPlanner (an open source planning engine in Java, renamed from Drools Planner).

But - so far - other optimization algorithms (most note-ably Tabu Search and Simulated Annealing) dwarf them in results, performance and scalability as shown in this image (higher is better):

See this GA blog article for detailed information.

Planning competitions rankings seem to indicate the same conclusion. For example in ITC2007 exam scheduling, the best 4 used Local Search (such as Tabu Search) and 5th used Genetic algorithms. That being said, genetic algorithms do have a very wide applicability opportunity, beyond typical planning/scheduling problems.

Justin Bozonier replied on Tue, 2013/05/28 - 8:44am in response to: Geoffrey De Smet

That makes a ton of sense. One of the biggest value of learning to wield Genetic Algorithms is in building a program that can solve problems on its own. Production quality machine learning is just as much art as it is science.

Thanks for sharing!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.