Copyright 1998 Simon St.Laurent

Letting go: the futures of XML and SGML

Please note: the following analysis is clearly speculative. While I am cheerfully optimistic about XML's prospects, it is always possible for the fortune teller's ball to turn cloudy. Many thanks to Len Bullard and Paul Prescod for disagreeing vehemently with many of my past discussions (and probably the present one as well), forcing me to rethink them and attempt to state them more clearly. Bob DuCharme and W. Eliot Kimber also provided insights that have strengthened the essay.

Extensible Markup Language (XML) is a young standard, made official only a few months ago. Very few applications for it yet exist; to my knowledge, no 'shrink-wrapped' applications are yet available. Still, XML is making some big waves in the media, and interest has certainly grown over the past year. XML promises to make exchanging data far easier, even between wildly different applications and application platforms. XML offers a simple standard format, easily read by both humans and machines, which can be turned to a large number of uses. The overhead required to implement XML applications is reasonably low, though programmers will have to create new applications, or at least import filters, to support it. XML promises ubiquitous information, in forms that can be readily reused, retransmitted and standardized.

XML's parent, SGML, promises the same thing. XML, created by the World Wide Web Consortium (W3C), is a subset of Standard Generalized Markup Language (SGML), created by the ISO, the International Organization for Standardization. Developers who have worked previously with SGML can move into XML with relatively little difficulty. For the most part, SGML users need to learn which parts the W3C left out, and examine the impact that removal has on document structures. Those who were skilled at creating SGML Document Type Definitions (DTDs) should be able to do the same work in XML, once they take note of the tools that have been removed from the toolbox.

The similarities between the two languages, in syntax, origin, and possible application, have led several members of the community to deny any separation between the two markup languages. From their perspective, XML is SGML - less SGML perhaps, but still SGML. Consortia, publications, newsgroups, and mailing lists currently mix discussion of XML and SGML, blurring the distinctions further. While it may be to these people and organizations' benefit to treat XML as SGML, this blurring constitutes a threat to the simplification that was a key reason for creating XML in the first place. That simplification has led the press and developers to give generalized markup a second chance at ubiquity.

Why Differentiate?

The SGML community has years of experience with the application of markup language, the development of workable structures, the implementation of systems for managing complex sets of information, and the theoretical know-how to make these things work. The XML recommendation was in fact created through the efforts and cooperation of a significant number of SGML experts. SGML offers a huge array of prefabricated structures that may certainly be applied to XML projects without major modification. If XML is simply a subset of SGML, why should it be treated any differently?

The answer is simple: XML is simple. The 'simplicity vector,' as Paul Prescod described it, is enough to give XML the boost it needs to become ubiquitous. The learning curve for XML is far shorter; the complete syntax can be explained and explored in a day to a technical audience. The basic syntax can be explained in twenty minutes.

Making the most of XML's simplicity provides two solid reasons for separating XML from SGML: branding and pedagogy. While some in the SGML community may wonder why XML, a mere subset, was given a separate name, this provides markup with a second chance. The technical media and publishers are willing to write about something new, something better, something that promises the ease of HTML with the power of SGML without being stuck with the drawbacks of either. SGML is not a hot topic in the technical press; XML is (presently) a darling. XML's simplicity differentiates it from its older predecessor enough that it rates another chance to develop a market. At the same time, for pedagogical reasons, it makes sense to keep the optional features, the shortened tags, and the other extras of SGML out of documents describing XML. For many users, 'simple' XML will be all they ever want and ever need, and adding the complexities of SGML to the learning curve contributes little.

Although the XML specification itself is daunting, especially for those without a computer science background, it isn't that difficult to explain. This simplicity pays off for developers (as well as for authors) - XML parsers can be built quickly, in a much smaller amount of code. XML's distinction between 'well-formedness' and 'validity' makes it easy for developers to use smaller parsers in situations where they don't need to validate document content, but they can use more complex engines where document structures must be examined.

The 'small and easy' approach taken by the creators of XML, and the community developments that have surrounded it, have multiple payoffs. Application developers can build editors and parsers according to structures that are strictly required and therefore can be counted upon. Developers don't even need to create their own parsers. A considerable number of parsers are available for use, and may even be used through a lightweight Java Application Programming Interface (API), the Simple API for XML (SAX), making it easy to switch parsers in and out of an application to support particular situations, platforms, or applications. By removing many of SGML's less-used and more complex features, XML gives developers and authors an easier time creating documents and systems for managing them.

This approachability stands in contrast to SGML's reputation for complexity, cost, and difficulty. While I am not aware of any opinion polls taken on the popularity of SGML, my personal experiences demonstrate a clearly negative opinion on the part of many web developers. I gave a seminar on Dynamic HTML for the ACM in Washington, DC, a city known for its extensive use of SGML. No one in the audience claimed to know or use SGML or XML. During the discussion of Cascading Style Sheets, I mentioned XML, and had to explain its backgrounds and uses. Responses from the audience included 'Does this mean we need to learn that crap?' and 'Oh no, not SGML again.' At a webmasters' meeting in Raleigh, North Carolina, I announced that I was working on a book on XML. Several people dropped by after the meeting to present their horror stories about SGML, that 'bloated beast.' While these may not reflect on SGML's actual popularity, and they certainly don't reflect on its usefulness, they suggest that SGML has a significant image problem, at least among web developers.

This image problem and the much simpler approach taken by XML are, in my mind at least, two excellent reasons for giving XML more status than 'subset of SGML.' XML doesn't need to be tarred with SGML's reputation for complexity, especially when it is XML's very simplicity that gives XML the opportunity to be ubiquitous.

Note: Some of SGML's reputation comes from SGML's status as a meta-language. Developers don't use SGML directly to create documents. Instead, they use SGML to create document structures that are then applied by authors, programmers, and readers. Information modeling is a difficult part of the task, though it can be made easier through the use of standardized document type definitions. If SGML's main failing was just that meta-languages are hard, then XML has little more chance of succeeding than SGML. XML shares a similar difficulty, though it has far fewer loose parts roaming its syntax. XML's stronger demand for conformance to particular data structures reduces the options for designers, making it more difficult to create 'obfuscated DTDs' and other strange creatures. This is, however, an area where the SGML community, at least at first, will have more experience than newcomers. Hopefully they will apply their skills to XML in ways that preserve users' perceptions of XML as a relatively simple tool.

Toward Ubiquity

XML has the opportunity to take advantage of several additional factors that have been denied SGML so far. The first is the World Wide Web, which has so far remained firmly based on HyperText Markup Language (HTML), an application (not a subset) of SGML. HTML has introduced the concepts of tags, elements, and attributes to millions of people across the globe. Although HTML began simple, it has grown more and more complicated. The cost of keeping up with (or ahead of) that complexity has driven the number of browsers widely available to two (Netscape Communicator and Microsoft Internet Explorer), with two others (Opera and Lynx) also in significant contention. Developing browsers has been made more complicated by the forgiving nature of existing browsers, which have allowed users to omit end tags, include incredibly malformed code, and nest tags in strange combinations. Netscape and Microsoft have been fighting a strange battle to match each other's bug handling in a vain effort to give developers a comfortable and friendly home. At the same time, HTML itself has grown considerably more complex, creating an industry of manual-makers turning out tomes on HTML syntax and usage.

XML enters this situation from a promising position. Because the standard was originated by the W3C, web developers and journalists have kept a closer eye on XML. XML is already in use as the foundation for several standards moving into use on the Web, including Microsoft's Channel Definition Format (CDF), and the upcoming Presentation Graphics Markup Language (PGML) and Synchronized Multimedia Integration Language (SMIL). 'Display' XML is under development in the Netscape Mozilla 5.0 browser, which will take advantage of James Clark's publicly-available Expat parser. In Mozilla, XML can be combined with Cascading Style Sheets (CSS) to build pages with the structures of XML and the appearance of HTML. XML could also join the fray of useful MIME types for email, making it possible for coworkers to exchange electronic data in an open format that is readily imported into applications.

XML also has a significant advantage for international use: Unicode support is mandated. Many developers in the United States and Europe will never need to use it, but developers in other countries will find their task made easier by a standard that already includes the hooks for displaying information (and structuring information) in multiple languages.

SGML could have done all of these things; my point is simply that it hasn't, and isn't likely to - except through XML. Because XML has dramatically lowered the bar for entry, more applications are likely to develop which rely on XML. XML is a convenient interchange format, as well as a relatively friendly document format. Developers can take XML parsers off the shelf (or off the net), plug them into their applications, and go, after climbing a very gentle and very short learning curve.

The Impact of Ubiquity

If my primary assumption so far is correct - that XML's simplicity and friendly learning curve will lead to many new implementations - then it may reasonably be assumed that XML will spread beyond the current community of SGML users, reaching a larger base of developers and users. If XML makes successful inroads into the Web page development market (as I hope it will), that base could become considerably, or even enormously, larger.

Bob Metcalfe, the creator of Ethernet, has coined a law (Metcalfe's Law) which describes the impact of growth on the usefulness of networks:

The usefulness of a network grows exponentially with its number of users.

The same can be said for many technologies, especially a technology like XML where users will need to exchange data. At present it helps sustain the sales of Microsoft's Office monolith, because file formats like Word and Excel are commonly used to exchange documents between individuals and companies. Although neither Word nor Excel is in common use as a format for Web pages, they are de facto standards on many corporate networks.

As the usefulness value increases, more users will take advantage of the technology, especially if costs are low relative to the usefulness. This is where XML has the opportunity to pass SGML more dramatically. Because the entry costs (learning and tools) are lower than those of SGML, more users with specific needs for the 'usefulness' of XML can enter the market early. This movement accelerates the overall usefulness of XML, creating a 'bandwagon' or 'network' effect that draws users into XML.

As the bandwagon advances, the price of tools should decline. More producers will be lured into the market, selling to a larger base of customers. Competition and the ability to spread development costs across a larger number of customers should bring XML editor and application program prices down from the exotic levels they hold today. This, in turn, lowers the barriers once more, encouraging more users to join the XML community.

Note: That bandwagon will be broken up to some extent by the different DTDs that will be applied by different user communities. Nonetheless, the use of XML as the basis for those standards allows for the use of common tools and knowledge bases.

This is the bandwagon that SGML hasn't been able to get started. SGML is used for an incredible number of documents and document types, but mostly by organizations (the Department of Defense, the Internal Revenue Service, IBM, the airlines, some publishers, chip manufacturers and others) with specific needs that have required a heavy-duty solution. If the benefits were great enough, the relatively high cost of an SGML solution posed little difficulty. Smaller organizations have stuck to their myriad proprietary formats for the most part, finding the cost advantages of off-the-shelf software to outweigh the benefits of (a possibly more effective, but more difficult) SGML solution.

Potential Stumbling Blocks

XML has already passed the first stumbling block - the proposed recommendation became a W3C recommendation on February 10. Microsoft, Netscape, Sun, and all the other developers now have a playing field, probably even a level playing field. Still, a number of possible stumbling blocks remain.

Learning Curve Confusions

One of the key problems facing XML developers right now is a lack of information. Most of the information currently available on XML is from SGML-oriented sources. While they certainly have an advantage in experience, sorting out which parts of a book are XML and which are SGML-only can be difficult for novices. At the same time, several of the information sources available expect users to have considerable background information already at hand, about either SGML or other computer science topics.

This is changing rapidly; many books are becoming available that address XML beginners' needs. At the same time, forums for XML discussions are being set up, ranging from mailing lists (XML-L and the more advanced XML-DEV) to a newsgroup (comp.text.xml, currently under discussion).

Another significant problem, however, is the repurposing of SGML tools to meet XML needs. While this is a perfectly rational and fairly simple thing to do, SGML vendors would do well to arrange their tool sets so that XML developers don't need to worry about stumbling into SGML technologies that are incompatible with XML.

Integration with Existing Technologies

The primary web implementation of XML to date, the XML parsers built into Microsoft Internet Explorer 4.0, is useful but inadequate. Developers have to create application code to use XML in Web pages at this point. 'Display' XML doesn't yet exist in that context. Mozilla 5.0 promises such support; Internet Explorer 5.0 may also contain it.

So far, XML isn't readily available for other applications. Integration remains in the hands of developers. Lotus and Microsoft have promised XML support for their core business applications; the nature of that support is not yet clear. Other applications, like databases, that could make extensive use of XML, still haven't been fitted with XML tools. Given the relative youth of the standard, this isn't yet surprising. Still, much of the future success of XML rests with the early implementors - building a bandwagon is difficult without success stories. The SGML community, with its years of experience doing information modeling, has a clear opportunity here to leverage themselves into XML.

Overpriced Tools

SGML has been well-known for its gilded price tags. While the cost of HTML design systems has fallen dramatically, SGML solutions continue to cost hundreds, thousands, or even tens of thousands of dollars. SGML's position as an open standard has at least opened up the field to competing vendors, but until those vendors are serving enough customers, the prices of tools is likely to remain high.

The SGML vendors are definitely looking at XML as an opportunity to expand their field. I hope as part of that, they lower their prices to expand the field, seeking volume rather than margin. Many of the XML tools currently available (including those from members of the SGML community) have no price tag at all; hopefully this kind of pressure will make the commercial vendors offer tools at lower prices.

Standards Squabbling

The W3C has taken the lead on XML, while the ISO has looked on, mostly favorably. While it is unlikely that XML syntax will require major revisions that will make it incompatible with SGML, pushes for changes to XML are likely to come from user bases outside the traditional SGML community. Users with no prior experience with SGML's structures may make requests that seem incompatible with the norms of SGML, or which conflict with the opinions of those who have (so far) made the decisions. I hope that this kind of conflict is far-distant; nevertheless it must be considered.

While XML's basic syntax is complete, many of its supporting technologies are far from finished. XML Linking (XLink) recognizes the influence of the SGML HyTime standard, but doesn't define itself as a subset. XPointers are inspired to some extent by the Text Encoding Initiative (TEI) Extended Pointers standard, but include significant modifications, detailed in Appendix B of the draft. Extensible Style Language (XSL) is based on the Document Style Semantics and Specification Language (DSSSL), but the original proposal also acknowledges "usability issues which have led XSL to diverge in various ways." Rules for styles, data, and programming models are yet to be determined. Conflicts over their development, whether between competing vendors or intellectual schools, could further slow XML and muddy the prospects of its wide acceptance.

Conclusions: XML != SGML

The critical importance of a gentle learning curve suggests that SGML, with its reputation of a steep learning curve, should keep its distance from XML, in public if not in standards meetings. The divergences between the supporting XML standards (XML-Linking, XPointers, XSL) and their SGML equivalents (HyTime, TEI, DSSSL) suggest that XML is in fact beginning to develop separate practices and will need a separate set of application tools. The need to broaden the audience of XML beyond the current SGML community (if XML is to attain ubiquity) suggests a need for texts, forums, and tools which are aimed at XML-only development, and aren't simply renamed SGML texts, forums, and tools.

XML is clearly positioned as the standard for the masses, while SGML will remain a uniquely powerful tool for developers who need its features. SGML use will likely grow as a result of XML's spread. As learning HTML makes learning XML easier, learning XML makes learning SGML easier. Still, by separating the two (in practice and semantics, if not in standards body politics), XML can reach a wider audience, cause less confusion, and reach ubiquity more quickly.

SGML is the honored parent of XML. XML is still in early childhood, and will need assistance from the SGML community for some time to come. Despite that parent-child tie, however, I strongly hope that the SGML community will let XML grow up on its own, as it acquires friends and finds new sources of excitement. Letting go is often difficult, but keeping this prodigy too close may stifle it. Acknowledging its separate identity and helping it grow will, in the long run, be better for both parent and child.


Comments? Suggestions?

Please contact Simon St.Laurent


Some of my other XML essays are also available.

Copyright 1998 Simon St.Laurent