I am a software engineer, database developer, web developer, social media user, programming geek, avid reader, sports fan, data geek, and statistics geek. I do not care if something is new and shiny. I want to know if it works, if it is better than what I use now, and whether it makes my job or my life easier. I am also the author of RegularGeek.com and Founder of YackTrack.com, a social media monitoring and tracking tool. Robert is a DZone MVB and is not an employee of DZone and has posted 96 posts at DZone. You can read more from them at their website. View Full User Profile

Is Object Serialization Evil?

07.07.2011
| 9600 views |
  • submit to reddit

In my daily work, I use both an RDBMS and MarkLogic, an XML database. MarkLogic can be considered akin to the newer NoSQL databases, but it has the added structure of XML and standard languages in XQuery and XPath. The NoSQL databases are typically storing documents or key-value pairs, and some other things in between. Given that any datastore will be searched at some point, you will always care how the data is actually stored or whether there is some way to query it easily. Once you start thinking about the problem, you quickly generalize to the “how do I persist any type of data” question. However, my focus is not going to be the comparison of the various data stores, but the comparison of how data is stored. More specifically, I want to show the object serialization, mainly the Java built in method, as a data persistence format is evil.

Given what you normally read on this blog, this may seem like an oddly timed post, but I have run into serialization issues lately in some production code and Mark Needham recently wrote an interesting post about this as well. Coincidentally, Mark is also working with MarkLogic, and there is an interesting item in his post:

The advantage of doing things this way [using lightweight wrappers] is that it means we have less code to write than we would with the serialisation/deserialisation approach although it does mean that we’re strongly coupled to the data format that our storage mechanism uses. However, since this is one bit of the architecture which is not going to change it seems to makes sense to accept the leakage of that layer.

The interesting part of this is that he has accepted using the data format of the storage mechanism, XML in MarkLogic in this case. Why is this interesting? First, it is a move away from the ORM technologies that try to hide the complexities of converting data into objects in the RDBMS world. Also, this is a glimpse into the types of issues that could arise from non-RDBMS storage choices as well as how to persist objects in general.

So, an RDBMS is typically used to map object attributes to a table and columns. The mapping is mostly straightforward with some defined relationship for child objects and collections. This is a well-known area, called Object-Relational Mapping (ORM), and several open source and commercial options exist. In this scenario, object attributes are stored in a similar datatype, meaning a String is stored as a varchar and an int is stored as an integer. But, what happens when you move away from an RDBMS for data persistence?

If you look at Java and its session objects, pure object serialization is used. Assuming that an application session is fairly short-lived, meaning at most a few hours, object serialization is simple, well supported and built into the Java concept of a session. However, when the data persistence is over a longer period of time, possibly days or weeks, and you have to worry about new releases of the application, serialization quickly becomes evil. As any good Java developer knows, if you plan to serialize an object, even in a session, you need a real serialization ID (serialVersionUID), not just a 1L, and you need to implement the Serializable interface. However, most developers do not know the real rules behind the Java deserialization process. If your object has changed, more than just adding simple fields to the object, it is possible that Java cannot deserialize the object correctly even if the serialization ID has not changed. Suddenly, you cannot retrieve your data any longer, which is inherently bad.

Now, may developers reading this may say that they would never write code that would have this problem. That may be true, but what about a library that you use or some other developer no longer employed by your company? Can you guarantee that this problem will never happen? The only way to guarantee that is to use a different serialization method.

What options do we have? Obviously, there are the NoSQL datastores but the actual object format is the relevant question not which solution to choose. Besides the obvious serialized object, some NoSQL datastores use JSON to store objects, MarkLogic uses XML and there are others that store just key-value pairs. Key-value pairs are typically a mapping of a text key to a value that is a serialized object, either a binary or textual format. So, that leaves us with XML, JSON and other textual formats.

One of the benefits of a structured format like XML or JSON is that they can be made searchable and provide some level of context. I have talked about data formats before, so I won’t go into a comparison again. However, do these types of formats avoid the issues that native Java object serialization has? This is really dependent upon what library you are using for serialization. Some libraries will deserialize an object without any issues regardless of whether the object field list has changed. Other libraries could have problems depending upon whether a serialized field exists in the target object, or there might not be solid support for collections (though that is doubtful at this point).

Given that even structured formats could have serialization issues, is the only safe path hand-coded mappings like those used by ORM tools? Some JSON and XML serialization tools use the same mapping methods as the ORM tools in order to avoid these problems. However, once you define these mappings, you are explicitly stating how an object gets translated. This explicit definition will require maintenance, but that is definitely cleaner than trying to trace down a serialization defect in some random stack trace.

So is implicit object serialization really worth the potential headaches? Or should we just consider it evil and never speak of it again?

 

From http://regulargeek.com/2011/07/06/is-object-serialization-evil/

Published at DZone with permission of Robert Diana, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Dmitry Zubanov replied on Thu, 2011/07/07 - 4:56am

Serialization is not evil. It is best way to build very fastest data base for Java desktop application. If you want avoid problem with serialization just follow seven simple rules:
- Use simple classes
- Class must implements Serializable interface.
- Class for serialization must have one empty constructor.
- Make all fields as private.
- Use simple getters and setters for private fields.
- Use transient fields if it really need.
- If you added new field and of course setter and getter dont forget check return value on null for avoid NullPointerExeption.

Load all serializable object in memory and you will have very, very fastest and responsive data base application.

I try this rules on my real application and i not have any problem with serialization. Me not need convert string into Java object (Integer, BigDecimal, etc.).
Now I have only the profit from serialization.

Artur Biesiadowski replied on Thu, 2011/07/07 - 5:02am

As far as I know, java serialization format has enough information to make a 'blind' searching possible. There is no technical limitation for creating a database, which would accept serialized java blobs and perform an xpath-like queries on java fields, without requiring the actual class file. It would be more powerful than working on xml without schema, but probably slightly less powerful than xml+schema.

As far as I know, nobody has done it yet.I wonder if there is some technical hurdle which makes it impossible or just nobody ever required such thing.

Denis Robert replied on Thu, 2011/07/07 - 5:51am

Of course, it's evil. It hides the data in a format which is only accessible by JVM-bound code, and only accessible if one has at least the binary for the class used to store the object, either in the same exact version, or in a materially compatible one. Those are a LOT of limitations. Serialization should only ever be used for data that can only ever be of interest to that specific piece of code (caches, temporary storage, etc...). If there's any possibility of the data being of interest to anything else, it should be stored in a transparent format accessible to a broader range of tools. Too many people consider the data the application they are writing is processing as their own. But in most cases, that data belongs to the person or entity which is executing the code. That entity may very well wish to bypass your application or even remove it altogether. If you store data which is of interest to that entity in an opaque format like Java Serialization, you are in effect holding their data hostage.

Robert Diana replied on Thu, 2011/07/07 - 7:29am in response to: Dmitry Zubanov

It may be simple to implement, but I wonder if the restrictions that you need to follow are worth the effort. There is still the problem of searching the data as well. However, you may have a point with a simple desktop application where you are the sole developer. That is just a niche case though.

Robert Diana replied on Thu, 2011/07/07 - 7:33am in response to: Artur Biesiadowski

You are correct, the java serialization format is well-documented. There is no technical limitation to supporting java serialized classes in a database, but there is a lot of work required to get that working correctly. Given the typically simple requirements needed to fit your data into a traditional RDBMS or even into a NoSQL database, it does not seem worth the effort to build that kind of database. There is also the problem of limited applicability, meaning only java applications can really store data into the database.

Dmitry Zubanov replied on Thu, 2011/07/07 - 8:53am in response to: Robert Diana

I am not sole developer. I work in team. We just use 7 simple rules and it's enough.
Searching data in serializable objects is fast and simply. I use standard Java instruments Collections and Reflection. It's very easy, powerful and fast.
About how get data from serializable db to any other data base. I wrote converter to XML, same as XML-RPC (JRPC)
Because we use simple object for stored, operation for convert to XML easy. And as well we may get dump in human readable format at any time.
Thank you Robert, for that raised an interesting question for discussion.

Lund Wolfe replied on Sat, 2011/07/09 - 4:01pm

Agree that java serialization for speed is totally appropriate. Hopefully, this data is just a temporary/derived copy of the source, human readable data stored in a database or XML. Another practical use is in saving application project/document data in binary when you don't trust the user to manipulate the data outside of the application. Of course, you could just save the document to a server side XML or database or just marshal the XML and then encrypt in the client file.

That being said, I have the unfortunate privilege of working on code which uses totally customized serialization which is very hard to understand and is very unreliable. Java serialization should follow a simple, standard pattern and be left to better developers.

You have the issues of not being able to delete or change the type of class members once you have serialized objects until EndOfLife, but you have the same issues with XML/XSD. You'll have to keep data changes additive and optional or provide transparent project versioning (XSD per version) within your code or a single version backward compatibility upgrade path for each version or a manual user version upgrade process.

Human readable data is a big advantage to the developers and the business, even when the application is the sole owner of the data, whether data is stored in regular data type columns or XML in the database. I would think an RDBMS is the best shared data format, but XML/XSD might be the right choice in some business scenarios.

Cloves Almeida replied on Sat, 2011/07/09 - 8:10pm

If you really care about data (and most companies do) the data must be accessible independent of the application. That's why RDBMS are so prevalent - they let the IT guys full access to their data.

Schema-less, ACID-less NoSQL have their strenghts and RDBMS have theirs - as of now, one does not obsolete the other.

Denis Baranov replied on Sun, 2011/07/10 - 2:34pm

:)

Serializable are for lazy... Real devs use Externalizable, true versioning and URLClassLoader. Watch out for those enums, too. (Don't use default serialization for Enums. Just don't.)

JBoss folks made a decent effort at enhancing serialization for RPC (not them alone, but their implementation is more credible).

For persistence, though? Never in a lifetime!

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.