I'm a Principal Engineer for Red Hat, as part of the Open Source and Standards Group. Joseph has posted 7 posts at DZone. You can read more from them at their website. View Full User Profile

Java Bayesian Classifier ci-bayes 1.0 released

06.19.2008
| 5637 views |
  • submit to reddit
ci-bayes, a project hosted on java.net, has released its first stable version. ci-bayes allows the use of a classifier to determine what classification a given object might fall into, given prior training, and provides multiple classifiers, hooks for persistence, and results for multiple classifications for each object tested.

ci-bayes is based off of the chapter on Bayesian classification from Toby Segaran's "Programming Collective Intelligence," and has been ported from the original python with the explicit permission of the author.

ci-bayes is built with Maven 2, and has an explicit runtime dependency on javolution; it provides factories for use with Spring 2, but those aren't required for runtime in the simplest case.

A simple example of how the classifier works might look like this:

FisherClassifier fc=new FisherClassifierImpl();
fc.train("The quick brown fox jumps over the lazy dog's tail","good");
fc.train("Make money fast!", "bad");
String classification=fc.getClassification("money"); // should be "bad"

 

Currently, ci-bayes uses the SpamAssassin testing corpora for performance and accuracy testing. The methodology is fairly simple: it first trains itself according to the SpamAssassin conventions with seven out of ten corpora, then goes back through the training set, testing the remaining three corpora to see if the result matches what SpamAssassin generated.

It's able to run the classification tests in just over eleven seconds on a single CPU core, with a 98% match with SpamAssassin; given that SpamAssassin and ci-bayes have different classification mechanisms and different functions, this is probably acceptable for most usages. (SpamAssassin uses a neural network to analyze spam; it's not a strict bayesian classifier, so a 98% accuracy is - in my opinion - a marvelous result.)

The binary jar for ci-bayes-1.0-SNAPSHOT is available on java.net.

0
Average: 5 (1 vote)
Published at DZone with permission of its author, Joseph Ottinger.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Raphael Valyi replied on Sun, 2008/06/22 - 8:08am

Hi,

 how does ci-bayes compares to the Weka bayesian classifier here: http://www.cs.waikato.ac.nz/~ml/weka/  ?

Why a new project?

Thanks,

 

Raphaël Valyi.

Joseph Ottinger replied on Sun, 2008/06/22 - 10:30am

Well, it doesn't present itself as a GUI for the experimenter or explorer. It also has a much simpler data model (from what I can see) and aims to be very direct about what it does. I remember looking at Weka at some point and not grasping what it was that I was trying to be told.

For me, bayesian analysis should be simple: I train the system with what I want, and I get a result back. No special files, no special syntax, no jargon (unless you refer to "corpus" and "corpora" as jargon, which is fair enough!), and it's straightforward.

That doesn't mean Weka isn't worth investigating; I just felt that it didn't satisfy my needs, so I wrote my own, again given permission from Toby Segaran.

mr Plow replied on Tue, 2008/07/08 - 7:15am

When I try to serialize an instance of com.enigmastation.classifier.impl.FisherClassifierImpl I get this:

java.io.NotSerializableException: com.enigmastation.classifier.impl.StemmingWordLister

What am I doing wrong?  Thanks.

mr Plow replied on Tue, 2008/07/08 - 4:44pm

java.lang.NullPointerException
        at com.enigmastation.classifier.persistence.Serializer.load(Serializer.java:36)

Joseph Ottinger replied on Wed, 2008/07/09 - 7:35am

The direct serialization stuff is broken at the moment, acknowledged. Create an issue for it and I'll see what I can do - but honestly, serializing the classifier itself is not a very good idea. It's much better to use the serialization event notification system.

mr Plow replied on Wed, 2008/07/30 - 6:48am

getClassification() seems to require two string values.  What is the second value for?

Joseph Ottinger replied on Wed, 2008/07/30 - 8:33am in response to: mr Plow

[quote=mrplow]getClassification() seems to require two string values.  What is the second value for?
[/quote]The second parameter would be the "default classification" if there's no strong match.

Aliaksandr Maka... replied on Fri, 2008/11/21 - 3:19am

Can't I remove category?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.