A Proposal for the Representation of XML DTDs as XML documents

Author: Simon St.Laurent (simonstl@classic.msn.com)
Date: 19 May 1998

Please Note: This Proposal is being obsoleted by the XSchema initiative on the XML-DEV mailing list. Updated information is available at http://www.simonstl.com/xschema/

 

1.0 Abstract

Extensible Markup Language (XML) currently uses notation inherited from the Standard Generalized Markup Language (SGML) for its Document Type Definitions (DTDs). While this achieves compatibility with SGML, it makes it impossible to extend the capacities of a DTD beyond those provided in SGML, and requires developers to understand two different syntaxes for documents and DTDs. This proposal offers an XML document representation for DTDs, which could either map directly to SGML DTDs or provide additional capabilities, making XML itself, in effect, extensible.

2.0 Status

This document has no official status. The author has no affiliation with the W3C, the organization developing and maintaining the XML standard, nor any affiliation with any W3C member organizations.

This document is also incomplete, intended as a springboard for further discussion rather than a complete proposal.

3.0 Relationship to Other Standards and Proposals

This proposal uses the document syntax presented in the W3C's XML 1.0 Recommendation of February 10, 1998 (available at http://www.w3.org/TR/1998/REC-xml-19980210) as its foundation, and refers extensively to the DTD syntax presented in that document as well. The XML-Linking (XLink) and XPointer proposals under development at the W3C are also essential elements of this proposal.

This proposal also bears some resemblance to the XML-Data W3C Note of January 5, 1998 (available at http://www.w3.org/TR/1998/NOTE-XML-data-0105/). Much of the syntax presented in that document could replace the sample syntax presented below. Despite syntactical similarities, however, this proposal is much less ambitious, proposing a mapping of SGML DTD syntax to XML document syntax rather than a new system of schemas and class hierarchies. While the more complex goals of XML-Data could eventually be realized through the extensibility of the system presented in this proposal, this proposal starts with much smaller steps that are hopefully easier to implement.

4.0 Overview and Rationale

In XML 1.0, XML uses two very different sets of syntax, as well as a variety of representations for linked content. Although the choice was made very early to preserve SGML compatibility, alternative notation might enhance XML's simplicity and its extensibility. The alternative notation described in this proposal could still be mapped to existing SGML notation, while making it possible to extend XML in new directions. This notation is also intended to make XML more self-consistent, reducing the number of notations needed to represent similar constructs.

The advantages of this representation include:

1. Additional Extensibility. This notation would make it much easier to include information about data typing and other content issues in the same document as the DTD. A DTD in this format could more completely represent data structures in ways that correspond to the experiences of database developers and programmers.

2. Common Notation. Developers would only need to know one basic syntax for both XML documents and XML DTDs. Parsers could use the same set of tools for parsing documents and DTDs. DTDs could be managed, edited, processed, and stored with the same set of tools used to manage documents.

3. Common Linking Mechanism. Developers could use the same mechanisms for linking to external content that they use with their documents. Instead of external entity references, developers could use the XLink vocabulary.

4. Improved DTD Referencing. Using an XML document representation of a DTD allows developers to reference portions of a DTD using the tools developed for referencing XML documents - XPointers. This would allow for the creation of extremely flexible entity references, and make it easier to create documents describing DTDs. (Implementation of the XPointer mechanism would require significant additional development on the part of the parser developers.)

5. Improved DTD Documentation. The elements that define elements, attributes, and entities could contain several kinds of descriptive information for use by developers and authoring tools.

The disadvantages of this representation include:

1. Incompatibility with SGML. While this representation may be mapped to a DTD using SGML syntax, SGML applications are currently unable to perform such mapping. (The same may be said for XML 1.0 parsers.)

2. Verbose representation. This representation is more verbose than the current syntax.

3. Extensibility beyond SGML. This representation's extensibility would lead to DTDs which could only partially be mapped to SGML syntax. The SGML DTD would, in effect, contain only a subset of the information in the XML DTD.

 

5.0 Introductory Examples

Note: The examples below present one possible choice of syntax for DTD documents. There may be varying interpretations of the proper use and naming of elements and attributes for this purpose. All names are subject to change pending the discovery of more appropriate terms.

The following XML 1.0 declarations create an element, FIGURE, which has a DESCRIPTION attribute and must contain an IMAGE element and may contain a CAPTION element.

<!ELEMENT FIGURE (IMAGE, CAPTION?)>
<!ATTLIST FIGURE
    DESCRIPTION CDATA #IMPLIED>

The same declaration, made in the new syntax, would read:

<ELEMENT TAG="FIGURE">
<CONTENTMODEL>IMAGE,CAPTION?</CONTENTMODEL>
<ATTRIBUTE NAME="DESCRIPTION">
<ATTCONTENT>CDATA</ATTCONTENT>
<ATTREQUIRED>#IMPLIED</ATTREQUIRED>
</ATTRIBUTE>
</ELEMENT>

This more verbose syntax carries the same meaning, but in a format that can be referenced (via XPointers) more easily as well as extended with additional elements.

For now, CONTENTMODEL is still indicated using XML 1.0 syntax. This element could also contain further element and attribute elements if that seemed appropriate. At present, the XML 1.0 syntax for content models seems more flexible, especially for representing complex content models.

Entities pose more complex issues because of the sharp distinction between general entities (which are used in document content) and parameter entities (which are used in document type definitions). Because of the collapsing of syntax performed by this representation, there may no longer be a good reason to maintain this distinction apart from backward compatibility.

Using XML notation for entity declarations makes it much easier to create sets of large entities in a document. Element content within an XML-notated entity could also be referenced with XPointers - the ENTITY element is, after all, simply the parent element of its content.

The general entity:

<!ENTITY GenSample "This is a sample entity.">

could become:

<ENTITY NAME="GenSample">This is a sample entity.</ENTITY>

Similarly, the parameter entity:

<!ENTITY % ParamSample "IMAGE,CAPTION?">

could become:

<ENTITY NAME="ParamSample" TYPE="%"><CONTENTMODEL>IMAGE,CAPTION? </CONTENTMODEL></ENTITY>

At the same time, entity notation for external entities could shift to an XLink model from the current PUBLIC and SYSTEM models. (Note: this would require XLink to support a mechanism for including linked content in a document; 'embed' is one possibility, but its behavior is not strongly defined.)

The external parameter entity:

<!ENTITY % extParam "http://simonstl.com/xml/extparam.pen">

could become:

<ENTITY NAME="extParam" TYPE="%" HREF="http://simonstl.com/xml/extparam.pen"/>

Usage of entities could remain the same - &entityname; for general entities and %entityname; for parameter entities.

6.0 Proposed Syntax

All XML documents created under this syntax must be well-formed. Comments are allowed, but the use of processing instructions, CDATA sections, and other content outside of elements, attributes, and entities is discouraged.

6.1 XML declaration

Documents using this DTD representation would need very little modification. It might be useful to add a field to the XML declaration to indicate which kind of DTD was used by the document, perhaps dtdrep. dtdrep='xmldtd' could indicate the XML 1.0 representation, and dtdrep='xmldoc' could indicate the representation proposed in this document. Perhaps the declarations could even use both representations, in which case dtdrep='xmlall' could perhaps be used.

6.2 DOCTYPE declaration

DOCTYPE declarations become an area of contention for both external and internal DTDs. The <!DOCTYPE notation could continue as a holdover from XML, or be replaced by a <DOCTYPE> element that either contained the declaration directly or pointed at the external DTD using notation consistent with XLink (the href attribute).

Documents with internal DTD subsets would need to decide how to represent their DTD. If a DOCTYPE element was used to represent the internal subset, that element would itself need to be declared in the DTD, most likely with a content model of ANY to avoid a complex declaration.

6.3 Entity Declarations

[Undeveloped - for placement only]

6.4 Element Declarations

[Undeveloped - for placement only]

6.5 Attribute Declarations

[Undeveloped - for placement only]

7.0 Remaining Issues

A DTD for DTDs?
<![CDATA[ in DTDs?
Whitespace?
Name spaces - should these ALL be prefaced with xml: ? Use a particular namespace?
Many others

Comments to: Simon St.Laurent (simonstl@classic.msn.com)