Sam has posted 1 posts at DZone. View Full User Profile

Crimes Against XML

07.04.2008
| 7784 views |
  • submit to reddit

Integration is an important part in any B2B application and XML is a common (IMHO good) way of passing data between systems. Generally one side will have to do the integration dirty work. Normally this is the smaller company, which for one reason or other is where I'm always working. DB schemas can be a pain to change but XML schemas particularly ones that are exposed externally are almost impossible to alter. In my time, I have seen schemas from hell that have made me want to rip my hair out and ruined what could have been a very nice day. Here is a list of what I consider XML crimes.

Not Providing an XML Schema
Whenever you're working with XML you should always have a schema. It
's completely inappropriate and amateurish to distribute XML externally and not provide a schema (preferably XSD). I'm surprised how often I see people working with XML without a schema. How else are you meant to document and validate your XML? Not to mention code completion and auto-generation.

Having Elements That Only Contain Attributes
Attributes are intended to describe the propert
ies of the element, basically their for metadata [1]. Sometimes people get confused and put all the data in attributes.
e.g.

<artist name="Arcade Fire" country="Canada" type="Band"/>

A much better way to present this data would be:

<artist type="band">
<name>Arcade Fire</name>
<country>Canada</country>
</artist>

Why is this better? Firstly, it's the way it was intended, secondly it's easier to parse, and finally it's more extendable. A key component in interoperability is consistency, having some values in attributes and others in elements breaks this rule.

Incorrect Character Encoding
Encoding seems to be one of those things most developers don't understand. Unless you have a
very good reason otherwise, all XML should be encoded as UTF-8. You should always add an XML declaration at the top of your documents.
e.g.

<?XML version="1.0" encoding="UTF-8" ?> 

For better or worse ASCII chars tend to have the same code points in commonly used encoding schemes. This means encoding problems generally don't present until a system goes live; commonly '`' is the culprit on Window's platforms. Just because you don't have any foreign/weird characters or you just don't care about internationalisation does not mean you have amnesty for this crime.

Another point that is important to note is that adding encoding = UTF-8, doesn't mean its encoded as UTF-8. By default Java produces text files in the encoding scheme of your OS (1251 on Windows), and Java is not the only accomplice in this crime.

Elements Containing Delimited Fields
Best described with an example:

<artist>
<name>Arcade Fire</name>
<members>Win Butler, Régine Chassagne, Richard Reed Parry, William Butler, Tim Kingsbury, Sarah Neufeld, Jeremy Gara</members>
</artist>

One of the good things about XML is that it can be used to map most data structures, such as lists in lists. Delimiting fields within elements shows the designer didn't think enough about future needs of the schema or doesn't understand XML. The example below is much easier to generate and parse.

<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
</artist>

No additional logic is required to convert this structure into a useful object model. Finally what's going to happen if one of your fields contains the character that is being used for delimiting?

Not Wrapping Repeated Elements
e.g.

<artist>
<name>Arcade Fire</name>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
<album>Funeral</album>
<album>Neon Bible</album>
</artist>

Once again a nightmare to parse and it's not exactly human readable. XML in this structure will also result in an ambiguous schema, which in turn plays havoc with auto-code generators (e.g. JAXB). This is much better:

<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
<albums>
<album>Funeral</album>
<album>Neon Bible</album>
</album>
</artist>

Repeating Elements Names in Different Contexts
Consistency is something all developers should strive for everyday, and in something as self contained as an XML schema it really shouldn't be that hard. One place XML schemas often fall down is in the names that are chosen for attributes and elements. In the example below the element named "artist" appears in two places, each of which has its own meaning and structure.
e.g.

<album>
<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
</artist>
<track>
<name>Intervention</name>
<artist>Arcade Fire</artist>
</track>
</album>

A better structure would be

<album>
<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
</artist>
<track>
<name>Intervention</name>
<track-artist>Arcade Fire</track-artist>
</track>
</album>

A logical and consistent layout will always make the parser simpler and the XML easier for humans to read. The opposite of this crime is just as bad, this variant involves not using the same element / attribute name for fields that are clearly the same. E.g. the "name" and "track-name" elements in the example below.

 <album>
<artist>
<name>Arcade Fire</name>
</artist>
<track>
<track-name>Intervention</track-name>
</track>
</album>

A much more logical structure is to use the same "name" element in both locations.

 <album>
<artist>
<name>Arcade Fire</name>
</artist>
<track>
<name>Intervention</name>
</track>
</album>

Abbreviating Element/Attribute Names
Yes XML is very verbose and files can become very big, but abbreviating element or attribute names to save space is not a good idea.

<a>
<n>Arcade Fire</n>
</a>

Firstly if your concerned about size, why are you using XML? Secondly XML contains a lot of repeated text, so it compresses very well. This XML is definitely not human readable and any code that works with it will be more difficult to maintain.

Key Value Lookup
The worst XML schema i have ever seen. I think an example is enough of a description

<root>
<entry>
<key>Artist</key>
<value>Arcade Fire</value>
</entry>
<entry>
<key>Country</key>
<value>Canada</value>
</entry>
</root>

This really flies in the face of everything that XML represents, you may as well just have a plain text file.

Conclusion

The two rules to stick to when designing an XML schema are:

  1. Ensure the XML is machine readable
  2. Ensure the XML is human readable

Please if your ever designing an XML schema remember both machines and humans need to be able to easily read the XML. Just think about it a little bit and don't go with whatever's easiest right now, otherwise others will have to live with your crimes.

Published at DZone with permission of its author, Sam Cavenagh.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Walter Laan replied on Fri, 2008/07/04 - 4:02am

Repeating Elements Names in Different Contexts

 

So how do you tell the difference when to use name instead of track-name, but track-artist instead of artist? I guess because artist can have sub tags and you want to prevent a loop? Would a better track artist tag be an artist-reference that points back to the main artist tag?

Emmanuel Bourg replied on Fri, 2008/07/04 - 7:20am

[quote]Having Elements That Only Contain Attributes

Why is this better? Firstly, it's the way it was intended, secondly it's easier to parse, and finally it's more extendable. A key component in interoperability is consistency, having some values in attributes and others in elements breaks this rule.[/quote]

Parsing attributes is actually easier than parsing elements, try writing a SAX handler for your two examples and compare.

 

 

aaaaa aaaaaaaa replied on Fri, 2008/07/04 - 2:53pm

Very useful, many developers should read this post!

Nos Doughty replied on Sat, 2008/07/05 - 12:51am

Firstly, I am also often in the situation described by the author - the smaller company lumbered with the dirty work.

However, I feel that most of the guidelines would be best described as 'style guide' for the authors preference. I suspect many developers would agree with him and therefore this article is very useful. However I do not believe his viewpoints represent a 'best practice' as such a thing is not possible with XML for the reason I will detail below.

Consider this: I find myself in agreement with Emmanuel Bourg on the ease of parsing attributes - which is why I *personally* prefer to use attributes - the difference can be substantial. On the other hand, the statement of Sam's that the original intent was to use nested elements is certainly correct. Why is this the case?

Simple. What most people who work with XML seem to forget is that it was *never* intended as a data 'container' format to share data between applications. It was designed as a simplification of SGML to replace HTML (all the style info was to be in XSL). Basically, to add 'metadata' to relatively free-form text streams. This is actually a big problem for the writers of XML processing libraries. For example, the structure:

 <ParentElement>
    <ChildElement/>
</ParentElement>

is actually represented internally as:

<ParentElement><TextNode>[CR][TAB]<ChildElement/>[CR]</TextNode></ParentElement>

You can see why this is when you remember the original use case of XML:

<Title>My Stuff<linebreak/>By Me</Title>

 

Now obviously, in the normal case of using Java libraries to marshal/de-marshal object structures to/from XML the developer does not want to see those spurious <TextNode>'s and so the libraries will skip them, but in fact it should be remembered that XML is actually a terrible format for representing object graphs and any use of it for that purpose will involves trade offs which will vary given the context of the work at hand.

Other quick points:

1) Please don't encourage XSchema. It should be used only when absolutely necessary. The sooner we can put a stake through it's heart the better. I suggest Schematron or RELAX-NG. I would personally prefer to receive a XML file with no XSchema which allows me to create a RELAX-NG version and give it back to the XML authors with instructions on how to use it.

2) Many of your points such as 'wrapping repeated elements' and the 'repeating names in different contexts' etc are irrelevant when using XPath to examine the document structure. Dom4j has fantastic XPath support. In fact, XPath is in my opinion the only reason to use XML over other formats more suited to representing object graphs such as JSON.

Still, a nice article with some good advice for creating nicely 'human readable' XML files that many developers could benefit from. As such I would recommend these guidelines more for the case of using XML for configuration files then for B2B transmission which (a) should not involve human interaction except for debugging and (b) speed of processing and transmission is more important than readability.

Regards

 

 

Ritesh Chitlangi replied on Sun, 2008/07/06 - 1:08am

I like XML Schema. It's nice. Also, about distributing a schema with XML, well, I am in favour of it, but I'm not sure if people sending the XML need to validate it, as any programme that accepts XML input from outside must validate it anyway. Validation by the client programme is a bit pointless in that case.

phil swenson replied on Mon, 2008/07/07 - 9:34am in response to: Nos Doughty

Nos Doughty +1

 

I'd much rather get an example XML snippet than a schema.... XML schemas are a complicated, brittle, overly-architected POS. Same could be said for SOAP :)

Sam Cavenagh replied on Sun, 2008/07/06 - 7:01pm

Nos:

Thanks for your input.  When to use attributes is a very subjective topic and perhaps it is a bit harsh of me to call their use a crime.  That said i do *personally* think they should be used sparingly.  An element without a value just seems wrong to me and i do think it facilitates a more extendable structure.

e.g. We need to add an additional property to one of our values.

<artist type="band"> 
    <name>Arcade Fire</name> 
    <country continent="North America">Canada</country> 
</artist>

 I havent had any exposure to Schematron or RELAX-NG and i will have a look into them when i have some time; that said XSD and DTD seem to be more widely used and therefore has better tool support. One reason I added this point because in the past i have done development work and everything’s fine and then 5 mins after it goes live it breaks because a previously unseen xml structure comes through.

 Holyroller:

As far as xml validation i meant that more from the receivers side rather than the sender’s side.

Phil:

In an earlier draft i did have "not providing example xml" but i dropped it for some reason.  In reality you always need an example and a schema.

 

Jason Osgood replied on Mon, 2008/07/07 - 5:27am

 Regarding any use of XML:

"A strange game. The only winning move is not to play."

  -- Joshua, War Games, 1983

Bruno Vernay replied on Mon, 2008/07/07 - 5:47am

+1 for Relax NG or even DTD.

XSD are very often misused because they are so cumbersome and people rely exclusively on tools to create them and create too many or too few constraints.

 A few XML examples are generally a very good idea too as well as some xpath expressions.

 Providing room for extensibility and putting constraints where really needed can be good too.

 But the title is too bold. It is a problem when you start to say that others misuse your techno. I seen a lot of this about Rest services generally.

That said, if they could create a Simple profile for XML that would be suited for Datam it would be nice. XML looks simple at first. but can be really tricky. Just search about XML canonical form.

Tero Vaananen replied on Tue, 2008/07/08 - 8:31am

 

This is a good list and I have learned many of these lessons the hard way. One thing that could be added is the use of references (either the XML native kind or your own). 

There is often pressure to use references extensively with repeating elements, by providing an ID to some entity that is described somewhere else in the document.  When doing this, the nature of XML is often forgotten and everything becomes a reference. I have seen this when people try to describe objects or database constructs in XML.

The problem with XML native references is that it requires unique IDs to refer to. These will inevitable be different from the 'natural' IDs of the XML entities we are describing, and are only unique in the XML document they are described in. If you want to transform and build new XML documents from other documents, it becomes exceedingly hard to keep things together. It's just not worth using them in this manner.

When you use your own ID references, it is actually very hard to validate, and you require business logic to understand the document. Also, tools like XSL works poorly with such constructs.

 Naturally, you still have to use references but I usually avoid them as long as I can. Even then when I do use references there is some information about the entity I am referencencing built in with the references so I can read the documents. Nobody can understand documents that have ID soup referencing things left and right - referencing what, where?

 

Guillaume Aubert replied on Fri, 2008/07/11 - 10:17am

Thanks Sam for this article.

Your recommendations seem natural but I don't think that they always apply for real life cases.

Regarding XSD, it is a very good idea to have a pseudo grammar for describing your XML file structure but I am not sure it should be used by programs to validate your xml.

For example we have a running system sending and receiving xml messages describing scientific measures and we add to remove the XSD validator which is too strict as the message received where not always compliant and also as our system evolves. It adds lots of complexity to support several XSD validators with different XSD versions (it can be a nightmare depending on the tools you use).

So instead we validate our messages with a home made validator that is more flexible as we can easily change it and adapt to real life cases.

It can also contain semantic validations that you will not avoid with or without XSD anyway. You could say that it is a problem of tools but maybe it is because XSD is cumbersome.

Life is all about comprimises. After all, you want to use XML  for solving your problem and not focus on XML only.

On the other hand, our XSD grammar is still maintained and provides a nice way to generate tool skeleton for accessing our data and also understanding the data structure. We are also migrating to Relax-ng which is more natural and hasn't been designed by Vulcans.

We also love XPath which is used by our bespoke validator and all our tools that are extracting information from the XML message. Xpath is probably the jewel of the xml tools.

Next point is should you use delimiters in your xml tag (space, comma) or separate all your information with a tag ?

I think it really depends.

In our messages, we encode a list of numbers and of course we do not intend to represent it like that:

<spectrum>
  <val>0<\val>
 <val>1<\val>
<val>2<\val> ......
</spectrum> 
 
instead it is like  that:
<spectrum>
0 1 2 3 4 ...
5 6 0 0 0 ...
1 0 1 0 1 ...
</spectrum> 
The first solution makes the XML message huge in comparision with the real information transported and non readable for humans
which are used to the second notation. 
Again it is all about compromises and surprisingly if you look at some of the ISO standards for different kind
of XML data they also have adopted the same kind of solutions ;-)
 Keep the good work
 Guillaume 

Blaise Doughan replied on Fri, 2011/03/25 - 4:56am

Once again a nightmare to parse and it's not exactly human readable. XML in this structure will also result in an ambiguous schema, which in turn plays havoc with auto-code generators (e.g. JAXB).

 JAXB implementations (Metro, MOXy, JaxMe, etc) do not require a grouping element.  In fact the default behavior is not to have them:

@XmlElement(name="member")
private List<String> members;

Although a grouping element may be added:

@XmlElementWrapper(name="members")
@XmlElement(name="member")
private List<String> members;

-Blaise

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.