xml:space and xml:lang

The W3C Recommendation for XML 1.0 contains two attributes in the xml namespace that didn't appear in time to be covered in XML: A Primer. The first, xml:space, helps applications determine whether or not they should pay attention to whitespace, as I pointed out on page 49 but didn't actually get to include in Chapter 5. The second, xml:lang, is designed to make it easier to present documents with content in multiple languages, supplementing Unicode's ability to present documents in different character sets.


Section 2.10 of the XML Recommendation defines a new attribute that allows elements to declare to an application whether their white space is 'significant'. This will probably receive extensive use in combination with XSL or perhaps the CSS white-space property to display documents correctly. Validating processors already must pass all non-markup characters to the application, and inform them of the element in which they appeared. This attribute acts as a flag, telling the application whether or not it should pay attention to the white space characters.

Note: It remains up to the application whether it actually does anything with the white space characters. While I expect that browsers and some other XML display applications will take heed of xml:space, many other applications will find it irrelevant.

The xml:space attribute is declared as follows: (Note that this attribute still must be declared.)

<!ATTLIST element xml:space (default|preserve) 'defaultchoice'>

The xml:space behavior is inherited from parent elements; if an element containing an xml:space value contains other elements, they too will handle white space as specified by the parent element. This can be overridden by a new xml:space atrribute in the child elements.

Because a default can be set in the DTD, it's simple to create documents that pay attention to white space by default; just set the default value of the xml:space for the root element to 'preserve'. For more consistent results (remember, parts of your XML document may be returned through XML-Link), assign this as the default value to all of your element types.


The xml:lang attribute gives XML authors a consistent way to identify the language contained within a particular element. Combined with XML's support for Unicode, this should make it easier to present internationalized versions of information. Developers can create documents with built-in translations, or make it easier for applications to know when to provide a translation. For example, suppose someone was trying to present quotes in Latin, with English description:


<DESCRIPTION xml:lang="en">

Caesar begins by describing the geography of Gaul.


<QUOTE xml:lang="la">

Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.


<EXPLANATION xml:lang="en">

It isn't the most thrilling opening to a great work on war, but it does explain some key issues to Romans who probably have never been anywhere near Rome.


In this case, I used "en" to indicate English and "la" to indicate Latin. An application equipped with a Latin translator might be able to convert the Latin section, when prompted (or not), into something resembling:

All of Gaul is divided into three parts, one of which is inhabited by the Belgae, another by the Aquitani, and the third by those who are called Celts in their own language, and Gauls in ours..

A translation program would probably produce something more literal, but you get the idea. The xml:lang attribute works much like the xml:space attribute, applying to the content of child elements as well as the element in which it is actually used. If xml:lang is used in a valid document, it must be declared. (This declaration is useful for declaring default languages as well.) The following declaration syntax will create an attribute without a default value:

<!ATTLIST element xml:lang NMTOKEN #IMPLIED>

To create a QUOTE element, with a default language of English, the following declaration would be appropriate:

<!ATTLIST QUOTE xml:lang NMTOKEN 'en'>

This could still be overridden for use with Latin, French, German, or quotes in any other language with a code defined in ISO 639 or registered with the Internet Assigned Numbers Authority (IANA). ISO 639 codes can be used directly, and may be followed with a country code to more precisely define the language: "en-GB", for instance, as opposed to "en-US". IANA codes must be prefixed with 'i-' or 'I-'; other codes may also be used, but must be prefixed with 'x-' or 'X-'. Unlike most of the rest of XML, all of these codes are not sensitive to case. (This is explained in greater detail in IETF RFC 1766.)

The xml:lang attribute provides more information than the bare Unicode data, and may save applications a lot of time determining which language is used for a particular element. It isn't a cure-all, though; effective use of this element will require consistent application support and probably some fairly complex style usage. The browser developers will hopefully seize on this opportunity to sort out some of the language confusions currently pervading the web. Combined with Unicode, this attribute makes it possible to deliver on many of XML's key promises for easier internationalization.