Big Data/Analytics Zone is brought to you in partnership with:

Franck has posted 18 posts at DZone. View Full User Profile

JSON-R: A JSON Extension That Deals With Object References (Circular And Others)

07.26.2013
| 6151 views |
  • submit to reddit

The JSON Reference Issue:

JSON, the well-known data-interchange format based on JavaScript, doesn't deal well with multiple references to the same object instances and even fails abruptly with circular references. To illustrate these two issue, take the following example:
var usa = { name: "USA", language: "English", currency: "Dollar" };
var cities = [
	{ name: "Washington DC", country: usa },
	{ name: "New York", country: usa },
	{ name: "San Francisco", country: usa },
];
alert(JSON.stringify(cities, null, '\t'));

What you get if you run the above code is:

[
	{
		"name": "Washington DC",
		"country": {
			"name": "USA",
			"language": "English",
			"currency": "Dollar"
		}
	},
	{
		"name": "New York",
		"country": {
			"name": "USA",
			"language": "English",
			"currency": "Dollar"
		}
	},
	{
		"name": "San Francisco",
		"country": {
			"name": "USA",
			"language": "English",
			"currency": "Dollar"
		}
	}
]

While this serialization result is correct, it is clearly redundant and inefficient compared with a serialization that would only stringify once the usa instance for Washington DC and put a short reference to this instance for New York and San Francisco.

Adding a circular reference in the model will make JSON certain to fail immediately:

var usa = { name: "USA", language: "English", currency: "Dollar" };
var cities = [
	{ name: "Washington DC", country: usa },
	{ name: "New York", country: usa },
	{ name: "San Francisco", country: usa },
];

usa.capital = cities[0]; // ie. Washington DC (circular reference)

alert(JSON.stringify(cities, null, '\t'));

With Chrome, you'll get this error in the JavaScript console:

Uncaught TypeError: Converting circular structure to JSON

Several people have tried to overcome this limitation. I'm certainly only aware of very few of them, such as "cycle.js " (from Douglas Crockford, the creator of the JSON format) and "dojox.json.ref " (from Kris Zyp).

However, these solutions require to add specific properties or objects for references and are often deep copying JavaScript objects, which leads to performance issues. Instead of inserting a short key or index for the reference, they are sometime using a graph path (eg. "#path.to.array.2.and.so.on") that can be at least verbose when dealing with deep graphs. 

A Demo Of JSON-R, Quick!

All Json-R  sources are public domain and can be downloaded at the following location: https://github.com/graniteds/jsonr.

To quickly see it in action, first download and install Node.js (if you didn't do it before). Then, issue the following commands:

$ git clone https://github.com/graniteds/jsonr.git
$ cd jsonr
$ node jsonr-server.js

Open your browser and point it at: http://localhost:8080/index.html.

Then, have a look at the code which contains a basic user documentation and comments.

A Bit Of Explanation:

Json-R is an attempt to solve the reference problem with the following design constraints in mind:

  1. It must be 100% compliant with the JSON format.
  2. It must work with any kind of JavaScript objects.
  3. It must not add specific properties or objects, it must not require deep copying.
  4. Serialized data must be as small as possible.
  5. It must use native JSON methods whenever possible.
  6. It must be human readable or at least it must provide a way to be so.
The algorithm used in Json-R is very classical and based on a dictionary of objects created during the serialization and recreated in the exact same way at deserialization time.

Basically, starting with the cities / country model above, we want to create a JSON string which will look like this:

[
	{
		"name": "Washington DC",
		"country": {
			"name": "USA",
			"language": "English",
			"currency": "Dollar",
			"capital": <reference to the dictionary object #1>
		}
	},
	{
		"name": "New York",
		"country": <reference to the dictionary object #2>
	},
	{
		"name": "San Francisco",
		"country": <reference to the dictionary object #2>
	}
]

Our dictionary of all distinct objects, as encountered during the serialization, contains five entries:

dictionary[0] = cities;
dictionary[1] = cities[0];
dictionary[2] = cities[0].country;
dictionary[3] = cities[1];
dictionary[4] = cities[2];

Explaining how and why this dictionary is recreated at deserialization time in the exact same way it was created during the serialization is out of the scope of the current article. Let's just say that the idea roughly comes from a very simplified variant of the Lempel-Ziv-Welch algorithm, which is used, for example, in both Java and AMF3 serializations: a dynamic dictionary of items is created and recreated during compression / decompression or serialization / deserialization.

The next problem is to find a way to encode these references in an unambiguous while JSON-compliant format. We use UTF-8 characters comprised in the first "private usage area" (ie. from \uE000 to \uF8FF), which gives us 0xF8FF - 0xE000 = 6399 indices. These characters are perfectly legal and JSON compliant.

It is worth noting that these special characters are unassigned in the UTF-8 specification and reserved for private, non portable, usage. For example, the character '\uF8FF' is used by Apple for its logo (a character representing an apple). It is correctly rendered as an apple with an Apple font but it can be represented as a Windows logo with a different font. Usage of these unicode characters is discouraged, especially on the Web, and you usually won't find them anywhere in your textual data.

By using these characters as dictionary indices, our above JSON-R data is as follow:

[
	{
		"name": "Washington DC",
		"country": {
			"name": "USA",
			"language": "English",
			"currency": "Dollar",
			"capital": "\uE001"
		}
	},
	{
		"name": "New York",
		"country": "\uE002"
	},
	{
		"name": "San Francisco",
		"country": "\uE002"
	}
]

However, these characters are not escaped in the produced JSON-R string and will be displayed as squares or any special unicode character placeholders if you look at them: '\uE001' would actually be displayed as '' and '\uE002' as '' (pretty much the same unreadable glyph, depending on your browser). Note also that you can still use characters in the \uE000-\uF8FF range in your data: if any string begins with such character, it will be escaped by Json-R so it won't be considered as a reference index during the deserialization (see implementation for detail).

So What:

The first four design constraints above are satisfied:

  1. Serialized data are valid JSON data.
  2. Any JavaScript object can be serialized that way.
  3. The implementation has a specific 'stringify' method that produces this data without changing or copying submitted objects.
  4. Each reference consumes 5 bytes (1 quote + 3 bytes for the unicode character + 1 quote).
Constraint 5 is only partially satisfied: we cannot use the native JSON.stringify method without breaking constraint 3. However, we can use the native JSON.parse method and then resolve references.

Constraint 6 is still an issue: unreadable (or even undisplayable) unicode characters are far from being human readable. That's why JSON-R comes with a 'revealReferences' method that will convert the above text into:

0@[
	1@{
		"name": "Washington DC",
		"country": 2@{
			"name": "USA",
			"language": "English",
			"currency": "Dollar",
			"capital": "@1"
		}
	},
	3@{
		"name": "New York",
		"country": "@2"
	},
	4@{
		"name": "San Francisco",
		"country": "@2"
	}
]

"<index>@[" or "<index>@{" sequences are dictionary entries with their respective indices. "@<index>" are references to these entries. It makes things much more readable, but it can be very confusing with a large graph of objects.

Current Limitations:

  1. The only server side implementation is for Node.js: Json-R, if it has any success, should have Java / PHP / .NET / <your language> implementations.
  2. Graphs of objects must not have more than 6399 distinct objects (including arrays). This could be easily fixed by encoding references with 2 or more characters after this limit (eg. "\F8FF\uE000" for 6400, etc.)
  3. Only objects are taken into account. An improved version (not for readability though) could use a dictionary of strings as well.

Published at DZone with permission of its author, Franck Wolff.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Arc Bac replied on Wed, 2013/07/31 - 6:45am

 Nice!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.