Geertjan is a DZone Zone Leader and has posted 467 posts at DZone. You can read more from them at their website. View Full User Profile

Interview: Wilfred Springer on Bit Syntax for Java

09.10.2009
| 6545 views |
  • submit to reddit

Wilfred Springer, Chief Scientist at Xebia in the Netherlands, is creating a framework for dealing with binary encoded data. That framework is called Preon and, below, Wilfred explains its purpose, its architecture, and how to get started using it yourself.

Can you first tell us a little about yourself, your background, and your interests in programming?

My name is Wilfred Springer. I like code. I just like the way it allows you to break down complex problems in increasingly simple pieces, and then reassemble that into new 'stuff', purely by the power of the mind.

I have been around in the Java world for ages. More than 10 years ago, I joined Sun's Java Center Practice, mostly because my employer back then was not able to get me on a Java project. After six years at Sun, I joined TomTom, where we basically grew a marvelous Java team from the ground up. At TomTom, most of our work revolved around creating services facilitating the navigation device, and services around consuming massive amounts of data from the device. This is basically where Preon was born.

More than anything else in the Java world, I like its community. It is huge! And there is so much stuff going on you can't see straight. I like that, and I also try to contribute my own fair share. I am in charge of a couple of open source projects, and I speak quite frequently at several Java conferences, mostly about open source projects I am involved in.

What's Preon, in one sentence?

Preon is a declarative data binding framework for binary encoded data. Preon is to binary encoded what JAXB is to XML, and Hibernate to relational databases.

What's the problem that Preon tries to solve? Who is it supposed to help?

Contrary to popular widespread belief, not all data on your disk and traversing the network is encoded in XML or another text-based type of encoding. In fact, I did a quick survey, and found that most of the files on my hard disk that are encoded in a non-binary file format are the files that I have on my disk because I develop in Java. Think source Java files, HTML-based Javadoc, configuration for Java-based middleware, etc. I did the same survey on my wife's hard disk, and found out that almost all of her files are binary encoded files. Yet, somehow we seem to have a strong preference for text-based encoding formats in the Java world. You wonder why.

I can think of at least one reason why: Java is notoriously bad at dealing with binary encoded data. You disagree? I refer to bug 4504839 and rest my case. That bug report is about Java not supporting unsigned int. Dealing with unsigned int data in Java is just quite painful. You need to do a lot of masking and shifting in order to be able to deal with it in a sensible way. And that sucks, since almost all binary encoded data formats I know of will constantly read unsigned data all over the place. Just dig up the description of an arbitrary file format on Wikipedia, and you will be able to confirm that.

However, support for unsigned int is not the only thing missing from Java. Another feature sorely missing is support for bitstream encoded data. There are plenty of file formats that rely on bit streams rather than byte streams. In fact, in these cases, the bit stream is often not well-aligned with the byte stream. So, a sequence of bits representing a certain attribute may very well cross the boundaries of a single byte. And Java does not have anything to help you with there. Consequently, dealing with this puts a tremendous burden on the developer.

Having said all of that, the complexity of dealing with binary encoded data is not just a Java problem. It's a general problem. At TomTom, we had two separate code bases for dealing with the binary map file format: one code base for decoding the data, and one for encoding the data. Keeping these two code bases and the documentation in sync, required a tremendous effort. It's not hard to see how that is undesirable.

All of this is solved by Preon. Preon allows you to capture the structure of the encoded binary format once, and the mapping to an in-memory data structure once, and get the decoder, the encoder (coming soon), and a description of the file format free of charge.

How does it compare to its competitors?

There really isn't all that much competition in this space, unless you talk about other languages. First of all, there are some academic initiatives to capture the bitstream format in a DSL. This includes the Flavor family of languages (Flavor, BFlavor, gFlavor), and BSDL (Bitstream Syntax Description Language). Most of that work comes from the media format world, MPEG in particular.

Flavor is probably the best example of this approach. It defines a text (not XML) based grammar for capturing the structure of a bit stream, and has a couple of tools for turning this into Java and C APIs. Basically, not unlike Lex and YACC or ANTLR.

Although Preon shares the same objective as Flavor, the approach is radically different. Flavor starts with the syntax description. Preon starts with the in-memory data structure. Flavor will generate code, and you just have to deal with that. Preon starts with the source code, and builds the decoder at runtime. Flavor is closed. You cannot extend Flavor without touching its source code. Preon is extensible. It has a number of plug points to add support for your own compression techniques.

Preon also shares some commonalities with Erlang's bit syntax. (At OOPSLA 2009 it will be presented as Bit Syntax for Java.) The difference is: bit syntax support is built into the language in Erlang; Preon offers it as a library built on top of the language. Erlang's bit syntax support is not extensible. Preon's support is extensible.

Can you describe the architecture of Preon?

There are several layers inside Preon. At the bottom layer, it provides a ByteBuffer type of abstraction for accessing bit stream data, called the BitBuffer. That's the abstraction that allows you to quickly access data as a bit stream, relying on the framework to maintain a pointer to the current location and deal with concurrent access.

Figure 1. Preon Architecture

Preon Architecture

The declarative data binding framework is layered on top of that. You declare the mapping from the in-memory data structure to its bit stream representation using annotations. At runtime, you pass a reference to the root class of that data structure to Preon, and it will return you a Codec for the entire data structure. Once you have the Codec instance, using it is simple: you simply use one of the convenience methods on the Codecs class, and it will return you an instance of the data structure based on the input source passed in.

Example 1. Preon decoding a Color

class Color {
@Bound byte red;
@Bound byte green;
@Bound byte blue;
}

Codec<Color> codec = Codecs.create(Color.class);
Color color = Codecs.decode(codec, new byte[] { 0x12, 0xff, 0x34 });

There are a couple of important things to know about Preon. First of all, when you ask Preon for a Codec, it will actually return a chain of Codecs. The Codec you get, will delegate to other Codecs with more precise responsibilities. (See Figure 2, "Codecs delegating to Codecs".) So, if you happen to have a Codec of Color, and the color model is based on the RGB model, then that Codec of Color will most likely internally delegate to three Codecs for bytes: one for the value of the red component, one for the green component, and one for the blue component.

This is one of the reasons why Preon is extensible. If you need to decode data in a way that is not supported yet, you can just plug in a component that will create instances of your own Codec where you need them.

Figure 2. Codecs delegating to Codecs

Codecs delegating to Codecs

Another thing important to know about Preon is that it lets you specify dependencies between the different parts of your encoded representation. For instance, it is quite common in binary encoding formats to first read the number of items in a list, and then the actual items. In Preon you can specify that relationship in your annotations, using an expression language made specifically for Preon.

Third, Preon will produce a Codec that will load data lazily as much as possible. If the format allows you to skip over a list of items, and you are not reading items from that list yet, then Preon will return a List, but then skip over the data in that List. Only if you need an item from that List, it will go back and fetch it from the underlying bit stream. Now, the TomTom map file serves as an excellent example why this is necessary: on disk it already uses 1.5GB of space. Imagine decompressing that compressed file into memory in a single pass. It would surely blow the top of the Java heap on a 32-bit VM.

So, these are just some of the features Preon has to offer. There's just one more thing I want to mention here: Preon also generates hyperlinked documentation from your 'syntax description'. The ambition is to have it generate a description that is comparable with the human-written descriptions you find on Wikipedia, and Preon is already coming pretty close to that.

Figure 3. Documentation generated by Preon


How can I make a plugin for Preon? What kind of plugins would be useful to make?

Preon has a couple of plug points. As I said earlier, it is quite easy to create support for your own specific type of compression. Say you want support for Huffman coded data. In that case, you would:

  1. First of all, define the annotations that would trigger the creation of the Huffman Codec when it encounters a Huffman coded attribute. That's fairly straightforward. You just need to define the attributes that are required to configure the Huffman Codec, such as a reference to the binary tree defining the way the data is encoded. Let's call this annotation @HuffmanCoded.

  2. Next, you need to implement a CodecFactory. Remember, we want to make sure that whenever Preon encounters an @HuffmanCoded attribute on a class, it will create an instance of the Huffman Codec. By implementing a CodecFactory, you basically create a component that will get a chance to recognize attributes of this type, and create an instance of that Codec.

  3. Third, you need to create the actual Huffman Codec. The Codec interface is not extremely complicated, so most of the work will be in the implementation of the decode() operation, but it all depends on the complexity of the encoding technique that you are implementing.

  4. As part of implementing the Codec, you also need to implement a CodecDescriptor. That is, if you want to make sure Preon spits out some useful documentation. If you don't care about that, then you can just return a default CodecDescriptor.

  5. Last but not least, you need to pass a reference to the CodecFactory if you create an instance of your Codec for a certain data structure. Preon will take that CodecFactory and add it to the list of CodecFactories it already supports out of the box. It will now recognize the need for creating an instance of your Huffman Codec once it encounters the @HuffmanCoded annotation.

And that's it. Is that a lot of work? I don't think so, but what is more important is that you not only added support for Huffman coding for a particular file format, but for all file formats that rely on Huffman coded Strings. How cool is that?

How/when did you decide to make Preon?

We started work on Preon at TomTom when we had a need to be able to decode the compressed map file format in Java, in a threadsafe manner, and noticed the problems of having to maintain the encoding and decoding algorithms and the documentation by hand.

Now, this project was eventually abandoned. However, I felt there was something incredibly attractive to the way of working, and started to work on it on my spare time. That means I basically rebuilt the framework from the ground up, while trying to use it on a couple of different encoding formats.

Making sure that Preon supported Java class files (so, using Preon to create a Codec for Java class files) appeared to be  particularly challenging. The way data needs to be found in the constant pool is awkward, to say the least. However, as a framework for supporting binary encoding in general, you cannot make an exception for awkward constructs.

You've presented Preon at several conferences. What's the feedback been like?

In general, I get a lot of positive feedback. It turns out we were not the only once finding limited support for binary encoded data in Java, and it definitely fills a need.

One of the often-heard comments on Preon is obviously that it currently does not support encoding yet. That is indeed a limitation, but that does not mean it cannot be done. The only problem is time, but I am determined to get it into the next major release.

Another question that pops up regularly is if you shouldn't steer clear of defining your own binary encoding standards, and rely on ASN.1 BER encoding instead. Now, I think that would be a mistake. BER encoding, or binary XML encoding standards, are great if your model is Infoset, but in many cases it isn't, and there are much faster ways to access the data if you acknowledge that.

Let's take a network graph as an example. If you would take that network graph, and encode the outgoing edges to other nodes by the offsets of the location of these nodes relative to the beginning of the file, then you can easily jump ahead and fetch that data without having to decode the entire file. That's a huge benefit over BER encoding.

The same holds for all other known (entropy-based) compression techniques. You can get a lot of mileage from these techniques, and it would be a waste to ignore that. In fact, this is why most of the data on your disk is still binary encoded.

Please say a few words about getting started with Preon. What do I need to download and what's the first documentation I should read?

If you want to read more on Preon, there are a couple of documents available on the web site, and on Scribd. I suggest that you first read the introduction, and then check out the examples.

You can check out the examples by checking out the sources, but you can also check them out online in Fisheye. Check out the preon-samples module. I would suggest that you start with the Bitmap example in preon-sample-bmp. Next, you might want to study the preon-sample-bytecode example, that basically contains the sources of a Java bytecode decoder without almost a single line of imperative code.

I imagine that you'd be all set to go after that. The Preon libraries can be downloaded from the Maven repository found here, but you will also need to add references to the Limbo and Pecia repositories, since Preon depends on these libraries.

Example 2. Maven POM Configuration

<dependencies>
<dependency>
<groupId>nl.flotsam.preon</groupId>
<artifactId>preon-binding</artifactId>
<version>1.0</version>
</dependency>
</dependencies>
...
<repositories>
<repository>
<id>limbo-repository</id>
<url>http://limbo.sourceforge.net/repository</url>
</repository>
<repository>
<id>pecia-repository</id>
<url>http://pecia.sourceforge.net/repository</url>
</repository>
<repository>
<id>preon-repository</id>
<url>http://preon.flotsam.nl/repository</url>
</repository>
</repositories>

If you feel the need to add your own Codec, check twice, because what you want may already be in there. However, if you still feel there is need to add your own Codec, first study the BooleanCodec example. That one is extremely easy to grasp, and will give you all you need.

What's the future of Preon? Timeline? Features you'd like to have? Contributions received from others?

Support for encoding is the first thing on my wish list for the next release. Next comes having the ability to annotate hexdumps to explain what's inside based on the meta data gathered by Preon. That will be just awesome. No more manual inspection of hexdumps, but instead click on a piece of hexdump and have Preon explain what it means.

Other than that, I would love to have more examples on the use of Preon. I started looking into Flash files the other day, and I think it makes an interesting case for Preon.

I haven't received any contributions from other people yet, apart from some of my former colleagues. However, I would like to invite anyone to give it a try on their particular problem. I would be happy to give you a hand, and explain Preon if the documentation falls short. It's time that we acknowledge that we write programs to be executed by computers, not by humans. Computers are just way better at dealing with binary data than with human-readable data. And I think Preon is an excellent tool for helping you make the computer do what it's best at.

AttachmentSize
wilfred.jpeg3.61 KB
Published at DZone with permission of its author, Geertjan Wielenga.

Comments

Fabrizio Giudici replied on Thu, 2009/09/10 - 2:02pm

Another feature sorely missing is support for bitstream encoded data. There are plenty of file formats that rely on bit streams rather than byte streams. In fact, in these cases, the bit stream is often not well-aligned with the byte stream. So, a sequence of bits representing a certain attribute may very well cross the boundaries of a single byte. And Java does not have anything to help you with there.

Indeed, there is ImageInputStream that has got some bitwise operations - still, when I measured some performance a couple of years ago I found that, in some cases, rewriting the feature from scratch gave faster results.

Given that, being in the production of image codecs, Preon does interest me. A thing that I've not understood is whether the code necessarily requires a runtime support - I presume it does - and how large it is. Second question, how performance is?

PS I presume we can assume from this interview that the TomTom runtime is written in Java? :-)

Mario Neirinck replied on Fri, 2009/09/11 - 1:20am

We also created our own librairies to be handle all sorts of bit manipulation. We created a SOA platform to communicate with several different shop floor devices like RFId readers and PLC's. When writing our Java device drivers we needed these features too.

Wilfred Springer replied on Fri, 2009/09/11 - 3:30am in response to: Fabrizio Giudici

The dependencies of Preon are listed here. I was actually surprised to find out about some dependencies, so it's good you asked. ;-)

Now, I think some of these dependencies could be turned into optional dependencies. (Like the dependency on CGLIB. That one is only required if you want a particular type of lazy loading. If you are not using that, it shouldn't be in your way.) However, there are some dependencies that will always stay there, such as the dependencies on Pecia and Limbo, and the transitive dependency on Antlr.

Having said that, I could imagine that it might be possible to generate code from the Codec information gathered by Preon at runtime. That would reduce the size of the runtime considerably. In fact, we considered generating C code from it at TomTom, but then we never had time to work on that.

... and unfortunately, the TomTom runtime is not written in Java. ;-)

Wilfred Springer replied on Fri, 2010/10/01 - 2:29pm

Annotated hexdumps have arrived: http://blog.xebia.com/2010/10/01/hexdump/

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.