Crimes Against XML
Integration is an
important part in any B2B application and XML is a common (IMHO good)
way of passing data between systems. Generally one side will
have to do the integration dirty work. Normally this is the
smaller company, which for one reason or other is where I'm always
working. DB schemas can be a pain to change but XML schemas
particularly ones that are exposed externally are almost impossible
to alter. In my time, I have seen
schemas from hell that have made me want to rip my hair out
and ruined what could have been a very nice day. Here is a list of
what I consider XML crimes.
Not
Providing an XML Schema
Whenever you're working with
XML you should always have a schema. It
's
completely inappropriate and amateurish to distribute XML externally
and not provide a schema (preferably XSD). I'm surprised how
often I see people working with XML
without a schema. How else are you meant
to document and validate your XML? Not to mention code completion
and auto-generation.
Having Elements That Only Contain
Attributes
Attributes are intended to describe the propert
ies
of the element, basically their for metadata [1].
Sometimes people get confused and put all the data in
attributes.
e.g.
<artist name="Arcade Fire" country="Canada" type="Band"/>
A much
better way to present this data would be:
<artist type="band">
<name>Arcade Fire</name>
<country>Canada</country>
</artist>
Why
is this better? Firstly, it's
the way it was intended, secondly it's
easier to parse, and finally it's
more extendable. A key component in interoperability is consistency, having some values in attributes and others in elements breaks this rule.
Incorrect Character Encoding
Encoding seems
to be one of those things most developers don't understand.
Unless you have a
very good reason otherwise, all
XML should be encoded as UTF-8. You should always add an
XML declaration at the top of your documents.
e.g.
<?XML version="1.0" encoding="UTF-8" ?>
For
better or worse ASCII chars tend to have the same code points in
commonly used encoding schemes. This means encoding problems
generally don't present until a system goes live; commonly '`'
is the culprit on Window's platforms. Just because you don't have
any foreign/weird characters or you just don't care about internationalisation does
not mean you have amnesty for this crime.
Another point that
is important to note is that adding encoding = UTF-8, doesn't mean its
encoded as UTF-8. By default Java produces text files in the
encoding scheme of your OS (1251 on Windows), and Java is not the
only accomplice in this crime.
Elements Containing
Delimited Fields
Best described with an example:
<artist>
<name>Arcade Fire</name>
<members>Win Butler, Régine Chassagne, Richard Reed Parry, William Butler, Tim Kingsbury, Sarah Neufeld, Jeremy Gara</members>
</artist>
One of the good things about XML is that it can be used to map most data structures, such as lists in lists. Delimiting fields within elements shows the designer didn't think enough about future needs of the schema or doesn't understand XML. The example below is much easier to generate and parse.
<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
</artist>
No additional logic is required to convert this structure into a useful object model. Finally what's going to happen if one of your fields contains the character that is being used for delimiting?
Not
Wrapping Repeated Elements
e.g.
<artist>
<name>Arcade Fire</name>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
<album>Funeral</album>
<album>Neon Bible</album>
</artist>
Once again a nightmare to parse and it's not exactly human readable. XML in this structure will also result in an ambiguous schema, which in turn plays havoc with auto-code generators (e.g. JAXB). This is much better:
<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
<albums>
<album>Funeral</album>
<album>Neon Bible</album>
</album>
</artist>
Repeating Elements Names in Different
Contexts
Consistency is something all developers should strive for everyday, and in something as self contained as an XML schema it really shouldn't be that hard. One place XML schemas often fall down is in the names that are chosen for attributes and elements. In the example below the element named "artist" appears in two places, each of which has its own meaning and structure.
e.g.
<album>
<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
</artist>
<track>
<name>Intervention</name>
<artist>Arcade Fire</artist>
</track>
</album>
A better structure would be
<album>
<artist>
<name>Arcade Fire</name>
<members>
<member>Win Butler</member>
<member>Régine Chassagne</member>
<member>Richard Reed Parry</member>
<member>William Butler</member>
<member>Tim Kingsbury</member>
<member>Sarah Neufeld</member>
<member>Jeremy Gara</member>
</members>
</artist>
<track>
<name>Intervention</name>
<track-artist>Arcade Fire</track-artist>
</track>
</album>
A logical and consistent layout will always make the parser simpler
and the XML easier for humans to read. The opposite of this
crime is just as bad, this variant involves not using the same element / attribute name for fields that are clearly the same. E.g. the "name" and "track-name" elements in the example below.
<album>
<artist>
<name>Arcade Fire</name>
</artist>
<track>
<track-name>Intervention</track-name>
</track>
</album>
A much more logical structure is to use the same "name" element in both locations.
<album>
<artist>
<name>Arcade Fire</name>
</artist>
<track>
<name>Intervention</name>
</track>
</album>
Abbreviating
Element/Attribute Names
Yes XML is very verbose and files can
become very big, but abbreviating element or attribute names to save
space is not a good idea.
<a>
<n>Arcade Fire</n>
</a>
Firstly if your
concerned about size, why are you using XML? Secondly XML contains a lot of repeated
text, so it compresses very well. This XML is definitely not human readable and any code that works with it will be more difficult to maintain.
Key Value Lookup
The
worst XML schema i have ever seen. I think an example is enough
of a description
<root>
<entry>
<key>Artist</key>
<value>Arcade Fire</value>
</entry>
<entry>
<key>Country</key>
<value>Canada</value>
</entry>
</root>
This really flies in the face of everything that XML represents, you may as well just have a plain text file.
The two rules to stick to when designing an XML schema are:
- Ensure the XML is machine readable
- Ensure the XML is human readable
Please if your ever designing an XML schema remember both machines and humans need to be able to easily read the XML. Just think about it a little bit and don't go with whatever's easiest right now, otherwise others will have to live with your crimes.
- Login or register to post comments
- 4301 reads
- Printer-friendly version
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)










Comments
Walter Laan replied on Fri, 2008/07/04 - 4:02am
So how do you tell the difference when to use name instead of track-name, but track-artist instead of artist? I guess because artist can have sub tags and you want to prevent a loop? Would a better track artist tag be an artist-reference that points back to the main artist tag?
Emmanuel Bourg replied on Fri, 2008/07/04 - 7:20am
[quote]Having Elements That Only Contain Attributes
Why is this better? Firstly, it's the way it was intended, secondly it's easier to parse, and finally it's more extendable. A key component in interoperability is consistency, having some values in attributes and others in elements breaks this rule.[/quote]
Parsing attributes is actually easier than parsing elements, try writing a SAX handler for your two examples and compare.
aaaaa aaaaaaaa replied on Fri, 2008/07/04 - 2:53pm
Nos Doughty replied on Sat, 2008/07/05 - 12:51am
Firstly, I am also often in the situation described by the author - the smaller company lumbered with the dirty work.
However, I feel that most of the guidelines would be best described as 'style guide' for the authors preference. I suspect many developers would agree with him and therefore this article is very useful. However I do not believe his viewpoints represent a 'best practice' as such a thing is not possible with XML for the reason I will detail below.
Consider this: I find myself in agreement with Emmanuel Bourg on the ease of parsing attributes - which is why I *personally* prefer to use attributes - the difference can be substantial. On the other hand, the statement of Sam's that the original intent was to use nested elements is certainly correct. Why is this the case?
Simple. What most people who work with XML seem to forget is that it was *never* intended as a data 'container' format to share data between applications. It was designed as a simplification of SGML to replace HTML (all the style info was to be in XSL). Basically, to add 'metadata' to relatively free-form text streams. This is actually a big problem for the writers of XML processing libraries. For example, the structure:
is actually represented internally as:
<ParentElement><TextNode>[CR][TAB]<ChildElement/>[CR]</TextNode></ParentElement>
You can see why this is when you remember the original use case of XML:
<Title>My Stuff<linebreak/>By Me</Title>
Now obviously, in the normal case of using Java libraries to marshal/de-marshal object structures to/from XML the developer does not want to see those spurious <TextNode>'s and so the libraries will skip them, but in fact it should be remembered that XML is actually a terrible format for representing object graphs and any use of it for that purpose will involves trade offs which will vary given the context of the work at hand.
Other quick points:
1) Please don't encourage XSchema. It should be used only when absolutely necessary. The sooner we can put a stake through it's heart the better. I suggest Schematron or RELAX-NG. I would personally prefer to receive a XML file with no XSchema which allows me to create a RELAX-NG version and give it back to the XML authors with instructions on how to use it.
2) Many of your points such as 'wrapping repeated elements' and the 'repeating names in different contexts' etc are irrelevant when using XPath to examine the document structure. Dom4j has fantastic XPath support. In fact, XPath is in my opinion the only reason to use XML over other formats more suited to representing object graphs such as JSON.
Still, a nice article with some good advice for creating nicely 'human readable' XML files that many developers could benefit from. As such I would recommend these guidelines more for the case of using XML for configuration files then for B2B transmission which (a) should not involve human interaction except for debugging and (b) speed of processing and transmission is more important than readability.
Regards
holyroller replied on Sun, 2008/07/06 - 1:08am
phil swenson replied on Mon, 2008/07/07 - 9:34am
in response to: nd7023
Nos Doughty +1
I'd much rather get an example XML snippet than a schema.... XML schemas are a complicated, brittle, overly-architected POS. Same could be said for SOAP :)
Sam Cavenagh replied on Sun, 2008/07/06 - 7:01pm
Thanks for your input. When to use attributes is a very subjective topic and perhaps it is a bit harsh of me to call their use a crime. That said i do *personally* think they should be used sparingly. An element without a value just seems wrong to me and i do think it facilitates a more extendable structure.
e.g. We need to add an additional property to one of our values.
<artist type="band">
<name>Arcade Fire</name>
<country continent="North America">Canada</country>
</artist>
I havent had any exposure to Schematron or RELAX-NG and i will have a look into them when i have some time; that said XSD and DTD seem to be more widely used and therefore has better tool support. One reason I added this point because in the past i have done development work and everything’s fine and then 5 mins after it goes live it breaks because a previously unseen xml structure comes through.
Holyroller:
As far as xml validation i meant that more from the receivers side rather than the sender’s side.
Phil:
In an earlier draft i did have "not providing example xml" but i dropped it for some reason. In reality you always need an example and a schema.
Jason Osgood replied on Mon, 2008/07/07 - 5:27am
Regarding any use of XML:
"A strange game. The only winning move is not to play."
-- Joshua, War Games, 1983
Bruno Vernay replied on Mon, 2008/07/07 - 5:47am
+1 for Relax NG or even DTD.
XSD are very often misused because they are so cumbersome and people rely exclusively on tools to create them and create too many or too few constraints.
A few XML examples are generally a very good idea too as well as some xpath expressions.
Providing room for extensibility and putting constraints where really needed can be good too.
But the title is too bold. It is a problem when you start to say that others misuse your techno. I seen a lot of this about Rest services generally.
That said, if they could create a Simple profile for XML that would be suited for Datam it would be nice. XML looks simple at first. but can be really tricky. Just search about XML canonical form.
Tero Vaananen replied on Tue, 2008/07/08 - 8:31am
This is a good list and I have learned many of these lessons the hard way. One thing that could be added is the use of references (either the XML native kind or your own).
There is often pressure to use references extensively with repeating elements, by providing an ID to some entity that is described somewhere else in the document. When doing this, the nature of XML is often forgotten and everything becomes a reference. I have seen this when people try to describe objects or database constructs in XML.
The problem with XML native references is that it requires unique IDs to refer to. These will inevitable be different from the 'natural' IDs of the XML entities we are describing, and are only unique in the XML document they are described in. If you want to transform and build new XML documents from other documents, it becomes exceedingly hard to keep things together. It's just not worth using them in this manner.
When you use your own ID references, it is actually very hard to validate, and you require business logic to understand the document. Also, tools like XSL works poorly with such constructs.
Naturally, you still have to use references but I usually avoid them as long as I can. Even then when I do use references there is some information about the entity I am referencencing built in with the references so I can read the documents. Nobody can understand documents that have ID soup referencing things left and right - referencing what, where?
Guillaume Aubert replied on Fri, 2008/07/11 - 10:17am
Thanks Sam for this article.
Your recommendations seem natural but I don't think that they always apply for real life cases.
Regarding XSD, it is a very good idea to have a pseudo grammar for describing your XML file structure but I am not sure it should be used by programs to validate your xml.
For example we have a running system sending and receiving xml messages describing scientific measures and we add to remove the XSD validator which is too strict as the message received where not always compliant and also as our system evolves. It adds lots of complexity to support several XSD validators with different XSD versions (it can be a nightmare depending on the tools you use).
So instead we validate our messages with a home made validator that is more flexible as we can easily change it and adapt to real life cases.
It can also contain semantic validations that you will not avoid with or without XSD anyway. You could say that it is a problem of tools but maybe it is because XSD is cumbersome.
Life is all about comprimises. After all, you want to use XML for solving your problem and not focus on XML only.
On the other hand, our XSD grammar is still maintained and provides a nice way to generate tool skeleton for accessing our data and also understanding the data structure. We are also migrating to Relax-ng which is more natural and hasn't been designed by Vulcans.
We also love XPath which is used by our bespoke validator and all our tools that are extracting information from the XML message. Xpath is probably the jewel of the xml tools.
Next point is should you use delimiters in your xml tag (space, comma) or separate all your information with a tag ?
I think it really depends.
In our messages, we encode a list of numbers and of course we do not intend to represent it like that: