Archives by Date


20 October 2000 - What to convert to XHTML first

While developers are starting to poke at XHTML, most existing Web sites already have enormous quantities of information stored in HTML which may or may not be XHTML-friendly. Worse, pages generated by CGI, ASP, or other scripting technologies may not be easy to change. Where should developers start?

In general, there are two rules that can provide some guidance. Content that may need to be delivered in multiple formats can take enough advantage of XHTML to justify the cost. Some content is simply easy to convert to XHTML simple documents where Tidy can do all of the work, or new documents where using XHTML from the start inflicts less cost than conversions from older HTML.

The biggest advantage XHTML 1.0 gives developers is the ability to repurpose content using XML tools - you can, for instance, create transformations from XHTML to Wireless Markup Language (WML) using XSLT stylesheets or DOM scripts. If you have documents for which delivery in multiple formats is critical, you may want to consider storing them in XHTML, or even generating XHTML from custom XML vocabularies.

Converting documents because they're 'easy' is pretty subjective, since everyone seems to have a different definition of 'easy.' Simple HTML, where layout is mostly headlines and paragraphs, can generally be converted with very few blips. Complex layouts using tables, gifs, and the occasional creative whitespace hack can be much more difficult, especially if you feel constrained to preserve the exact look of the original document in every browser environment it served. (Recent browsers make XHTML work much easier.)

New projects can benefit immediately from XHTML's cleaner structures. The discipline of cleanly nested structures will have an effect on the level of discipline in code used to generate documents, and may help Web developers build more maintainable projects. (Yes, I know that's likely a dream.) Developers using client-side scripting to build dynamic HTML sites will also find XHTML easy to work with. Many of these developers have already adopted some of XHTML's strictures to mark structures more precisely.

Even projects which won't benefit immediately may want to transition to XHTML for new work - if nothing else, it means that future transitions should be much simpler.


13 October 2000 - Finding your way among XHTML specs, Part II

While figuring out XHTML 1.0 can be fairly difficult for many developers, XHTML 1.1 and its likely successors demand a lot more learning, at least for the core of developers who want to take advantage of its capabilities for extending the vocabularies used in XHTML documents.

XHTML 1.1 still uses the familiar HTML vocabulary, though the main thrust of XHTML 1.1 moves far more deeply into XML and its promise of Extensible Markup Language. While XHTML 1.0 described itself as "Extensible Hypertext Markup Language", it did very little to live up to the promise of extensibility, leaving that to non-standard implementations (described in 3.1.2) and future drafts.

The W3C itself has at least three specifications that are prime candidates as extensions to XHTML: MathML, which defines markup for representing mathematical equations; SMIL, the Synchronized Multimedia Integration Language; and SVG, Scalable Vector Graphics, which allows developers to describe images as vectors rather than as bitmaps. All three of these languages are built on an XML base which can be easily mixed with XHTML.

Making this integration work requires the tools provided in Namespaces in XML, the DTDs of XML 1.0, and eventually XML Schemas. These standards are not well-known for their ease of use. Namespaces remain burdened with controversies, DTDs are criticized as unnecessarily complex for their capabilities, and XML Schemas have both fans and opponents.

These heavy requirements may require that Web developers fragment into two classes of markup creators those whose work sticks to established vocabularies, using documented combinations of XHTML and other vocabularies, and those who create their own vocabularies. The second group of developers will need to know the ins and outs of vocabulary development as well as the integration mechanisms described in Modularization of XHTML, while the first may stick with the definitions provided by XHTML 1.1 and XHTML Basic .

Eventually, XHTML may also come to include XLink and XPointer as critical specifications for creating hypertext links. For now, XHTML developers may continue to treat them as exciting curiosities rather than critical tools, but XHTML development will likely come to include far more than today's HTML.


9 October 2000 - Finding your way among XHTML specs, Part I

The W3C has created a number of different Recommendations which rely, cross-reference, and influence each other. Developers trying to work with XHTML may find that they need to learn Cascading Style Sheets (CSS) as well, while some developers may need to extend their understanding of the Document Obect Model (DOM), and others may be exploring Namespaces in XML.

In this two-part tip, we'll start by looking at XHTML 1.0 and its supporting specs, and then move on in the next tip to XHTML 1.1 and the new features and specs developers may (or may not) need to learn.

It's possible to use XHTML 1.0 exactly the same way as HTML 4.0. Most HTML developers never learned to read a Document Type Definition (DTD), and built documents using their understanding of document structures gleaned from tools, experience, and reference material rather than by reading the formal outlines presented in the HTML 4.0 DTDs.

Similarly, there's no requirement that developers understand DTDs for them to use XHTML 1.0. While strictly conforming validating XHTML processors will check the document structures against those DTDs, developers can base their document structures on reference- material that is friendlier to humans and included in many XHTML books.

Developers do need to understand the rules for making their XHTML documents into well-formed XML. Nesting tags properly and using the empty tag syntax appropriately will take care of most of these needs, and the XHTML 1.0 specification is a fairly self-contained description of them.

It's probably a very good idea for developers planning to make the transition to XHTML to become acquainted with Cascading Style Sheets, if they haven't already. CSS gives developers very flexible control over how their information is presented while requiring very little modification of the (X)HTML document itself. Developers planning projects that will adhere to the Strict DTD will probably need to use CSS if their formatting plans go beyond the extremely basic.

Developers already using the Document Object Model (DOM) will find themselves at home in XHTML, which uses the same DOM features as HTML. DOM programmers can also apply their skills to generating and manipulating XHTML on the server, using XML parsers to read and modify documents before sending them to users. Many of the DOM implementations (notably Internet Explorer 5) currently include extensions which may not work in strictly-conforming environments, but the basic rules and structures are the same.

The DOM also comes in multiple levels (1, 2, and 3 so far) and is being broken into modules. DOM Level One covers an XML Core and HTML Extensions, while DOM Level Two covers far more and has been broken into modules. DOM Level Three is still getting started, but it finally addresses the key issues of loading and saving documents from a DOM tree. Developers can pick and choose which of these pieces they need, though only DOM Level One has widespread implementation at present.

Although XHTML 1.0 raises (in Section 3.1.2) the possibility of mixing different XML vocabularies (like MathML, SVG, or SMIL) with XHTML, and notes the use of XML Namespaces to identify different vocabularies. These multi-vocabulary documents are not strictly conforming, however, and it will likely take XHTML Modularization's arrival to make them generally useful. Developers who want to get a head start on vocabulary-mixing may want to take a look at the Namespaces in XML specification, but most developers can stick with XHTML 1.0's single default namespace declaration.

Developers who want to move forward into mixed vocabularies (or create their own vocabularies) have a lot more to deal with, as we'll explore in the next tip.


3 October 2000 - Dealing with Markup Characters, Part II: CDATA sections

While replacing markup characters with XML's built-in entities may work in many cases, it still leaves a few difficult situations unresolved. The scripting engines built into most Web browsers won't accept scripts that substitute entities for <, >, and &. Fortunately, CDATA sections offer an XML-safe alternative that works in some HTML situations.

CDATA sections are an XML feature that tells the parser to ignore all occurrences of markup between the initial <![CDATA[ and the closing ]]>. These can be included in scripts while hidden by script comments:

<script type="text/javascript"> //<![CDATA[
...
if (i<12) {
}
...
//]]>
</script>

If your script includes a ]]>, write it as ] ]> or ]] > to avoid ending the CDATA section prematurely. In theory, you can also use CDATA sections to escape markup anywhere in a document, in markup examples, for instance:

<code><![CDATA[
<example>This is an example. You should be able to see the start and end tags here.</example>
</code>]]>

Unfortunately, this isn't likely to work on regular HTML browsers - they won't know what to do with the opening <![CDATA[, may display the closing ]]>, and will probably interpret the markup characters included in the section. Someday...


26 September 2000 - Dealing with Markup Characters, Part I: Entities

XHTML is less forgiving of stray markup characters than HTML, and provides additional mechanisms to help developers 'hide' these characters from parsers. Some of those mechanisms will be familiar to HTML developers, while others existed but were obscure and yet another is new to XHTML.

Just like HTML, XHTML provides built-in entities for representing markup characters inside of markup without disrupting parsing:

Entity Reference Character Represented
&amp; Ampersand (&)
&lt; Less Than (<)
&gt; Greater Than (>)
&apos; Apostrophe (')
&quot; Quote (")

HTML parsers were fairly relaxed about letting ampersands and greater than symbols appear in some places within a document, especially within URLs, but XML parsers are much less forgiving and XHTML requires conformance to the XML standard.

To stay out of trouble the simplest possible way, use the entity references everytime you need to use these characters as something other than markup. (You only need to use the references for quotes and apostrophes inside of attribute values, and then only to avoid conflicting with the quotes containing the attribute value.)

For example, we'll include the XML document below in an XHTML document:

<example>This document contains &lt;, represented by an &amp;lt; entity.</example>

The XML document contains element markup (<, >), and some trickier bits of escaping designed to leave &lt; in the document (&amp;lt;). The element markup is easily handled with &lt; and &gt;. If we want to keep the XML document looking as is for an example in the XHTML version, we'll have to replace the initial ampersands of both entity references with another ampersand reference, as &amp;lt; and &amp;amp;lt;.

If we wanted to put this into a code element in XHTML, it might look like:

<code>&lt;example&gt;This document contains &amp;lt;, represented by an &amp;amp;lt; entity.&lt;/example&gt;</code>

In the browser, all of this would be presented as:

<example>This document contains &lt;, represented by an &amp;lt; entity.</example>

To make this work, you just need to apply the same techniques which were available in HTML and do it consistently.


22 September 2000 - Easily Manipulated Content

A lot of Web developers take a glance at XHTML 1.0 and see a lot more work, new rules to adhere to, and new compatibility problems to deal with. It's true that XHTML is more work in the short term, but beyond that initial work lies promise that goes well beyond the limited vision of 'Dynamic HTML' that currently defines the outer limits of HTML user interface design.

While XHTML may seem like a recipe created by neat freaks, its insistence on orderly and predictable structures has enormous benefits. As Dynamic HTML developers have learned over the years, it's much easier (and far more predictable) to manipulate document structures when the boundaries of those structures are clear.

A lot of Web developers don't want to manipulate their documents. They see their HTML documents as carefully crafted and set in stone, with no need for scripting or other supplemental poking. After all the work that goes into layout, further manipulation seems like the last thing anyone would want. It adds to the maintenance cost of a site, it requires different groups of developers to coordinate their efforts, and it often produces results that satisfy none of the participants.

Web development isn't just about producing documents, though. Web sites send information to customers, some of whom keep track of that information or process it further. Search engines are an obvious example of this processing, but tools that can take HTML and format it for print or agents that seek out information automatically are just a few of the additional possibilities that become easier to implement when document structures are coherent and clear.

XHTML also opens new vistas in document storage, management, and reuse. The clean structures make it easy to reference content by its location in a document, simplifying cross-references. Sophisticated document storage systems can give different pieces of XHTML documents to different editors for simultaneous (but non-conflicting) editing. Sites that need to present their information in other formats, like Wireless Markup Language (WML) can define transformations of XHTML documents without having to run everything through a database.

Maybe 'no pain, no gain' is a difficult things to sell to Web developers who won't see much benefit from XHTML today, but the possibilities XHTML opens are real. XML developers are already taking advantage of many of these techniques, paving the way with tools and experience the Web development community can use.


20 September 2000 - A Closer Look at Tidy

HTML Tidy (often called Tidy) is an HTML and XHTML cleanup tool created by one of HTML's leading lights, Dave Raggett of the W3C. Originally written to help developers create valid HTML, it now also helps developers create valid XHTML.

Tidy is a command-line utility that provides a number of options for cleaning up HTML and XHTML, as well as a few extra features like slide creation based on heading levels. It cleans up start and end tags along with quoting attributes, adds end tags where appropriate, sorts out some common badly structured HTML, like lists without containers and horizontal rules stuck inside of headlines. (I've made that last mistake for years.)

Tidy can also clean up formatting-oriented code, replacing it with Cascading Style Sheets when appropriate. It can work around some flavors of non-HTML markup, including Active Server Pages and PHP. Most important for XHTML, it offers an '-asxml' option that makes Tidy generate XHTML.

Tidy also provides a configuration file that makes it easy to set up Tidy once and not have to use command line options repeatedly. Using these options, you can make XHTML that's ready for non-validating XML parsers, complete with numeric character references replacing the named entities XML parsers may or may not process.

Some HTML is just too broken for Tidy to handle. While Tidy will do its best, and report issues it's not certain it handled properly, some issues will raise errors rather than warnings, requiring manual intervention.

Tidy is written in C and available under an open source license, and compiled versions are available for a large number of platforms. A complete list of platform-specific binaries, including cases where Tidy has been built into a HTML development environments, is available.

A Java port is also available.

Dave Raggett does ask that developers who want to say thanks for Tidy send him a postcard from their home area - the mailing address is on the main Tidy page.


19 September 2000 - Using HTML editors for XHTML

While many developers are used to HTML editing systems having their own style of output, XHTML requires a somewhat higher degree of control over that output. During the transition period from HTML to XHTML, developers will need to monitor their usage of HTML editors closely.

More and more HTML editors are producing cleaner code - including start and end tags consistently, providing users with choices between Strict, Transitional, Frameset, and 'anything goes', as well as support for key technologies like Cascading Style Sheets (CSS).

These improvements make it much easier to produce consistent HTML which can be converted to XHTML without loss, but there hasn't really been a rush to release new versions of software to support XHTML, especially as a core function. Some tools, like Evrsoft's 1st Page 2000, include the W3C's Tidy as an auxiliary function, but still focus on HTML internally rather than XHTML.

Tidy and tools like it are going to provide a bridge between HTML and XHTML for a long while. Even as more software companies integrate XHTML support with their products, users probably won't move as quickly.

Web developers can take advantage of Tidy where it is built into software in order to skip a step, but will probably have to keep a copy of Tidy around for handling information coming from sources they don't control.

Although one of the early complaints about XHTML was that it makes life for hand-coders more difficult, requiring them to keep track of balanced start and end tags manually, hand-coders may actually have an easier time producing clean XHTML than those using HTML editors, at least for the near future.

Developers looking for a pure XHTML solution may want to look at Mozquito Factory, an XHTML editor with a focus on creating HTML-compliant smart forms.


15 September 2000 - Style sheets from HTML to XML and back

As the W3C works to prune HTML of its formatting-oriented past, the tool of choice for formatting is becoming Cascading Style Sheets (CSS). While weak (though slowly improving) implementations have held CSS back, the W3C is pressing forward with CSS, extending it to give designers finer control over the look and feel of their pages. At the same time, the XML world has developed a very different approach to style sheets in XSL, the Extensible Stylesheet Language, which may also have something to contribute to XHTML developers.

Cascading Style Sheets allow developers to assign formatting properties to particular elements and combinations of elements within documents. CSS makes it easy to tell Web browsers thing like "make all h1 elements 24pt sans-serif type in bold purple" or "make all list items nested inside list items italic". CSS provides a set of rules for allow these statements about formatting to interact, and for allowing multiple sets of statements to interact.

While complete CSS Level 1 implementations are only starting to appear (the latest versions of Mozilla, Opera, and Internet Explorer 5 offer substantial support), CSS is critical to the W3C's hopes of making the Strict flavor of XHTML dominate as XHTML moves forward into XHTML 1.1 and XHTML 2.0. Without real CSS implementations, designers aren't going to be able to create the pages customers demand without falling back on the 'legacy' formatting tools. In the case of frames, where the W3C's proposed replacement would rely on CSS positioning, this is especially critical.

The W3C is building CSS to support both sides of XHTML's heritage - both HTML and XML. On the XML side, however, a different approach to style may offer tools to XHTML developers. Extensible Stylesheet Language (XSL) doesn't just describe what different element and element combinations should look like. Instead, it provides rules for transforming source documents into result documents, and a result document vocabulary which is strictly formatting-oriented, called formatting objects.

While formatting objects are still in development, the tool for moving from a source document into a result document is available today. Called XSLT (XSL Transformations), it may prove useful to developers who need to convert XML documents into XHTML. At the same time, it could be applied to XHTML documents, converting them into formatting objects or other XML vocabularies. While XSLT isn't as familiar as the Document Object Model (DOM), it has a growing user base and more and more XSLT material is becoming available every day.


14 September 2000 - The risks of using XML tools with XHTML

Apart from the difficulties developers may encounter in learning how to apply tools built with XML in mind, there are a few potential tricks brought on by the XML approach that might throw XHTML developers.

While XML includes all of the parts most HTML developers are familiar with - elements, attributes, comments, and the DOCTYPE declaration - it also includes some extra parts like stronger use of the DTD, processing instructions, and different priorities for preserving information.

XML 1.0 allows, and in some ways expects, that parsers will transmit only a 'finished' version of a document to an application. This 'finished' version may come without comments, will have entities fully expanded (to single characters in the XHTML case), and may include default values for unspecified attributes, but no DOCTYPE information.

Applications that save a document back out as an XML document may be saving a document that contains the same information, but in a very different form from what was supplied. XML processors may also not know to leave a space after the element name in an empty tag (like <br />), as this workaround is specific to the needs of HTML browsers, not XML processing.

As more and more XHTML developers start to take advantage of the toolkits already built for processing and storing XML, the life cycle of an XHTML document will become more important. While Web developers have grown used to (and sometimes weary of) the changes that HTML editors will make in documents, the prospect of changes happening in other environments - from Save As... in a browser to incoming and outgoing information in an XML-based data storage system - may not be so appealing.

Even within a browser context, treating XHTML as XML may require some extra work. Parsers have enough latitude that some of them (non-validating parsers) can ignore external resources, like the DTDs used by XHTML to assign default values to attributes and to define character entities. While some applications (notably Mozilla) are making an effort to address these issues in software, many applications, especially home-grown applications, may not. All of this is written into the XML 1.0 specification, but many of these details aren't well-understood.

These issues don't appear in every XML processor, but they appear in many, without ever violating even the spirit of XML 1.0. 'Round-tripping' a document through a parser and back can be a tricky business, and there is already at least one guide to what a 'safe' subset of XML might look like. While Common XML may be useful for developers creating XML applications from scratch, the recommendations in that document don't fit with the tools XHTML has already used. (They may serve as useful warnings, however.)

XHTML developers who want to use XML tools will have to carefully examine the tools to ensure that they support all of the features XHTML demands for strict conformance. This may mean extra work in making XML tools XHTML-safe, as the 4xt.org group has done with XT, a popular XSLT processor, or it may mean adding a layer of code that makes sure DOCTYPE declarations appear in their proper place and that empty tags are presented properly.

Although XML and XML toolkits have an enormous amount to offer XHTML developers, XHTML developers need to make sure that the toolkits fit their needs completely.

For more general information on XML interoperability issues, see http://www.simonstl.com/articles/interop/.

And yes, I am the editor of that Common XML document, but it derives from the work of the SML-DEV mailing list.


13 September 2000 - Using SAX for lightweight XHTML processing

While the Document Object Model (DOM) is very useful for kinds of processing that require complete access to the entire document at the same time, there are many cases where documents are being created, filtered, or transformed but developers never need access to the entire document at once.

Recognizing the need for a lightweight API, David Megginson and the XML-Dev mailing list created the Simple API for XML (SAX), a Java event-based interface that uses an XML parser to 'read' a document to an application, announcing events like the start of an element, the appearance of text, and the end of an element.

Initially devised as a standard for communications between XML parsers and the applications using XML parsers, the SAX standard quickly spread and developed new variations. By creating programs that both accepted SAX events from the parser and transmitted SAX events to the application, developers could create 'filters' that extracted information from documents, restructured information, or deleted it entirely. By creating small applications that only accepted SAX events and wrote them back out as XML documents, developers could then use SAX events to create documents.

SAX has developed into a general-purpose API for handling XML-based information, including XHTML. It has spread beyond its Java roots to Python, Perl, and C++. Microsoft's latest releases of their MSXML parser include a SAX2 implementation that can be used from C++, Visual Basic, and scripting languages.

Developers who want to create XHTML from within programs now have an alternative to writing large flows of text. SAX allows developers to receive and describe documents as discrete events, passing off syntax reading (parsing) and syntax writing (XML output) to other tools, all while avoiding the overhead of large document trees in memory. You still need to make sure that elements begin and end in the proper order, but it becomes easier to abstract documents a little further from the actual markup.

SAX is now in its second generation, called SAX2. For more on SAX, see http://www.megginson.com/SAX/.


12 September 2000 - Taking the DOM beyond Dynamic HTML

Developers who are familiar with dynamic HTML - and in particular the W3C DOM - have a head start on creating and processing XML and XHTML. The Document Object Model (DOM) provides an abstract view of an XML, HTML, or XHTML document which can be manipulated using various scripting and programming languages.

Most dynamic HTML developers are familiar with JavaScript or VBScript, but the W3C DOM provides JavaScript/ECMAScript and Java bindings, along with a CORBA IDL which makes it easier to port the DOM to other environments.

While the DOM is missing a lot of parts (like a common interface for loading documents and saving them back out), it provides a foundation which developers can take from Web browsers and JavaScript to back-end servlets written in Java, Active Server Pages (ASP) written in VBScript, or COM implementations written using C or Delphi, to name just a few.

The DOM represents a document as a set of nodes - a document node which contains all of the rest of the nodes, and then element and attribute nodes for storing element and attribute content, text nodes for containing text, and comment nodes containing comments. Mapping an XML or XHTML document to a DOM is pretty straightforward, and generally much easier than mapping old HTML to a DOM.

If you've developed dynamic HTML before, even if you were using an environment like Internet Explorer, which provides an enormous number of non-standard extensions to the DOM, you already have experience in working with this key abstraction of a document. The skills you have today for programming interfaces in browsers can be easily transferred to document creation, manipulation, and other processing including (of course) interface-building.

Many XML parsers, including Apache's Xerces, IBM's XML4J, Sun's Project X (now Apache's Crimson), and Microsoft's MSXML provide some level of support for the DOM. Recent generations of browsers, including Microsoft's Internet Explorer 5.x, Mozilla, and Netscape Communicator 6.x, provide solid or improving support for the DOM Level One.

Work on the DOM is continuing at the W3C. While Level One of the DOM has been complete for nearly two years, Level Two is currently a Candidate Recommendation and Level Three is just getting started. Each Level adds new material, rather than functioning as a version number.

The DOM is not without its critics, of course. Storing document trees in memory as interconnected objects can bring enormous overhead when large documents are being processed, and the DOM interface sometimes feels clunky to developers who find its 'document nature' counter-intuitive.

There are a number of alternatives to the DOM, including the Simple API for XML, a lightweight interface for processing XML (and XHTML documents) which doesn't create a document tree in memory. JDOM builds a tree, but provides an interface aimed specifically at Java programmers. There are a number of tools for data binding and mapping XML to object structures, though most of these go well beyond the needs of XHTML developers.


11 September 2000 - XML's perspective on information

For all of its limitations, HTML has done a remarkable job of presenting information in a form that both users and content developers can understand. Everything is a document, and described in terms of a document - text, headings, images, tables, etc. While the markup might be a little obscure sometimes, and can certainly get obfuscated, the general structure of documents made it fairly easy to figure out how to put information into an HTML document, and left an enormous amount of room to designers to present information in creative ways.

XML is a tool for creating documents, but these aren't necessarily documents in the HTML or paper senses of the word. XML documents are a series of characters that contains structured and labeled information, with a well-defined beginning and end. While these 'documents' can be used to convey traditional document-like structures, they can also convey programming object structures, database tables, lists of information, and nearly any other structure you can create with a computer.

XHTML takes advantage of XML's orderly structures and clear labels to clean up HTML a bit, but it's also preparing the way for a world in which developers - both application developers and Web document developers - can include their own structures and labels within an HTML framework, or even include HTML's structures and labels within whatever XML framework they come up with.

Developers moving to XHTML from HTML may want to take a closer look at the possibilities XML opens up, and consider how they might like to extend the familiar HTML vocabulary for their own projects. Creating a vocabulary doesn't magically create applications capable of processing that vocabulary, but it does make it possible to build more interesting applications that go beyond the traditional (perhaps eventually even legacy) 'Web browser'.


10 September 2000 - Comments and XHTML

Developers who have used comments in HTML as a place to hold information intended purely for human consumption will be happy to find that those comments work exactly the same way in XHTML that they worked in HTML. On the other hand, developers who have used comments to hide scripts and style sheets, or to pass information to programs, may find that it's time to update their documents.

XHTML comments look exactly like HTML comments:

<!--This is a comment. Heh heh. -->

Like HTML comments, XHTML comments can appear before the start tag of the html element, in text within the document, or after the end tag of the html element.

If the XML declaration is used, it should appear first in the document, and any comments should appear after. If comments or whitespace appear before the XML declaration, it won't be recognized as the XML declaration.

Developers who have been using comments to hide scripts and style sheets may encounter some problems. XML takes comments very seriously - XML parsers aren't even required to report the contents of comments to applications. That means that if you use the script- hiding technique shown below, your scripting code may simply disappear in certain XML-oriented applications:

<script type="text/javascript">
<!--
...
if (i<12) {
}
...
//-->
</script>

While the comments will keep Netscape 2.0 and earlier browsers from misinterpreting the < in the if statement as markup, XML parsers may discard the script contents entirely.

Although XML parsers aren't commonly used in Web browsers (yet, at any rate), you may find yourself taking advantage of more and more XML tools, like content management systems and transformation engines, that weren't built for XHTML in particular - they just read it as XML.

XML offers a solution to this situation: CDATA sections. They allow you to mark text that shouldn't be parsed, and can be used inside script elements with script comments to keep them from interfering with the scripting engine:

<script type="text/javascript">
//<![CDATA[
...
if (i<12) {
}
...
//]]>
</script>

CDATA sections begin with <![CDATA[ and end with ]]>. If your script's code includes a ]]>, you'll need to break it up a little - ] ]> or ]] >, as seems appropriate. Otherwise, XML parsers will interpret the CDATA section as ending at the first ]]> they find, corrupting your script (by stripping the ]]>) and then leaving the second ]]> in the text or reporting a parsing error.

You can use the characters <, >, and & anywhere you like inside a CDATA section, so they're quite useful for more than script-hiding.

Other tools that use comments, like server-side includes, may want to consider shifting to a processing instruction syntax (<? instead of <!--, and ?> instead of -->) if their templates need to pass through XML-oriented environments.


8 September 2000 - XHTML and the DOCTYPE declaration

Every XHTML 1.0 documents is required to have a Document Type Declaration (DOCTYPE declaration) that indicates which of the three XHTML Document Type Definitions (DTDs) is used by the document. Each DTD - Strict, Transitional, and Frameset - has a DOCTYPE declaration that must appear to identify which type of XHTML is in use.

The DOCTYPE declaration must appear before the html element, but after the XML Declaration, if one appears. This means that

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<-html xmlns="http://www.w3.org/1999/xhtml">
....

is legal (once you remove the dash in front of the html), but the following two examples are not:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<?xml version="1.0" encoding="UTF-8"?>
<-html xmlns="http://www.w3.org/1999/xhtml">
<!--this example is illegal.  It will load into an XML parser, but the XML
declaration will be ignored.-->
....

and:

<?xml version="1.0" encoding="UTF-8"?>
<-html xmlns="http://www.w3.org/1999/xhtml">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--this example is illegal, and will not load-->
....

XHTML 1.0 provides three Document Type Declarations, one each for the Strict, Transitional, and Frameset DTDs. The contents of the Document Type Declaration must match the version in the specification - even matching case - except for the URL at the end. This URL must point to a copy of the XHTML 1.0 DTD, either the copy at the W3C or another copy that will be accessible to XML parsers processing the document.

For example, the XHTML 1.0 specification presents the Document Type Declaration for the Strict DTD as:

<!DOCTYPE html 
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "DTD/xhtml1-strict.dtd">

Since the XHTML 1.0 specification is stored at http://www.w3.org/TR/xhtml1, the relative URL is acceptable in this context. However, since most documents created using XHTML won't be stored on the W3C's servers, developers should use:

<!DOCTYPE html 
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

Similarly, the DOCTYPE declaration for the Transitional DTD should be:

<!DOCTYPE html 
   PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

And the DOCTYPE declaration for the Frameset DTD should be:

<!DOCTYPE html 
   PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">

If you examine the DTDs, this information is provided in the opening comments. Each DTD has a public identifier, which provides a network-independent way for software to identify the DTD, as well as a system identifier, the URL for reaching a copy of the DTD.

Whitespace in the DOCTYPE Declaration outside of the quoted contents will be normalized by the parser, so you can put this declaration on a single line or multiple lines as you find convenient.


7 September 2000 - Three flavors of XHTML 1.0

XHTML 1.0 defines its task as "A Reformulation of HTML 4 in XML 1.0", and in achieving that it continues the precedent set by HTML 4 of having three different definitions of the language. All three definitions can be used to create valid XHTML, as each has its own Document Type Definition (DTD).

The Transitional DTD is probably the closest to HTML as commonly practiced on the Web. It includes a full range of formatting-oriented markup and supports the target attribute for linking between frames.

The Frameset DTD provides the markup needed to build frame-based sites, like the frameset, frame, and noframes elements. The Frameset DTD is intended for documents containing frames, not documents which appear inside of frames but don't contain frames themselves. Otherwise, it is very much like the Transitional DTD.

The Strict DTD represents XHTML the way the W3C would like to see it. Deprecated elements (like isindex) have been removed, formatting elements and attributes (like the font element and the align attribute) stripped, and all support for frames (including the target attribute) removed. Formatting and presentation are largely left to Cascading Style Sheets (CSS).

Moving forward into XHTML 1.1, it looks like the W3C is going to use the Strict DTD as its foundation, though they've gone to the trouble of creating modules representing the features the Strict DTD lacks. When XHTML 1.1 comes out, this may be a powerful motivation for learning more about XHTML Modularization.

An even simpler version of XHTML, XHTML Basic, strips down the XHTML 1.1 vocabulary to a minimum level for communicating in environments where full support for XHTML may not be available. This could include cellphones, PDAs, embedded browsers, or even simple XML programs that need to reuse simple textual markup from XHTML.

Developers can choose which version of XHTML to use on a document-by-document basis. Most sites converting from legacy HTML will probably find it easiest to move to the Transitional DTD. Sites using frames will like have to use the Frameset and Transitional DTDs. Developers who want a head start moving toward XHTML 1.1 - and can live without support for frames - can use the Strict DTD.


6 September 2000 - Namespaces in XHTML 1.0

XHTML 1.0 makes very simple use of XML's namespace facilities, using them only to label the elements within an XHTML document as XHTML. Despite some initial controversy, XHTML 1.0's use of namespaces now seems settled, stable, and simple to use.

While the DOCTYPE declaration has long served to identify HTML and XHTML documents, providing element-level identification is becoming more important as the W3C develops new ways to include multiple vocabularies in the same document. As Scalable Vector Graphics (SVG), Synchronized Multimedia Integration Language (SMIL), Mathematical Markup Language (MathML), and other task-specific formats emerge, the W3C would like to allow developers integrate them with XHTML.

To avoid the 'tag soup' that characterized HTML's growth until recently, the W3C has come up with a mechanism for identifying vocabularies uniquely. The Namespaces in XML Recommendation allows developers to associate Uniform Resource Identifiers (URIs) with element and attribute names. Programs which understand XML Namespaces can then work with a combination of the base element name and the URI with which it is associated.

There are two ways to associate a URI with an element name. The element may have a prefix in front of the element name, with a colon between the prefix and the element name:

xhtml:p          xhtml:img           xhtml:table

That prefix is then mapped to a URI. The other option avoids the use of prefixes, and provides a URI mapping for all elements without a prefix, associating them with the 'default namespace'. XHTML 1.0 uses this approach, sparing XHTML developers a lot of typing which isn't yet necessary.

The namespace URI for XHTML 1.0 documents (whatever DTD they may use) is http://www.w3.org/1999/xhtml. The namespace declaration, which is made using an attribute, appears in the html element of all conforming XHTML 1.0 documents:

<html xmlns="http://www.w3.org/1999/xhtml">

This declaration tells XML parser and XHTML processors to associate the namespace URI http://www.w3.org/1999/xhtml with all non-prefixed elements contained by the html element.

As long as the namespace declaration shown above appears in your html elements, your documents will have met XHTML 1.0's namespace requirements. (You still need to meet the other requirements, of course!) The namespace declaration made here will apply to all of the elements contained by the html element, unless another element redefines the default namespace.

Multiple namespaces shouldn't appear in conforming XHTML 1.0 documents. However, XHTML 1.0 includes a section describing how mixing elements from different namespaces might work, though such documents are not strictly conforming XHTML 1.0 documents.

Later tips will cover namespaces in more detail.


5 September 2000 - Quote those attribute values

HTML browsers were always quite forgiving about whether or not attribute values were quoted. The only times that attribute values really needed to be quoted were cases when the values contained whitespace. Otherwise, the browser would guess which parts of a tag were attribute values based on whitespace.

Markup like this was permitted, and remains common:

<IMG SRC=mypic.gif HEIGHT=20 WIDTH=30 ALT="This is my picture">

XML took a much stricter approach to syntax in general, requiring that all attribute values be quoted. Enforcing this requirement makes it much simpler for parsers to figure out which content in a tag is an attribute, and avoids the potential for chaos brought on by the possibility of quotes or equals signs inside of attribute content. XHTML enforces the same requirement.

XHTML requires that the element above look like:

<img src="mypic.gif" height="20" width="30" alt="This is my picture" />

or:

<img src='mypic.gif' height='20' width='30' alt='This is my picture' />

XHTML, like XML, permits developers to choose single or double quotes as attribute delimiters. These can vary from attribute to attribute, but the type of quote used to mark the beginning of an attribute value must also be used to mark its end.

Thus, you can use markup like:

<img src="mypic.gif" height='20' width="30" alt='This is my picture' />

but you can't use:

<img src="mypic.gif' height='20" width="30' alt='This is my picture" />

4 September 2000 - Moving to lower case

One of the biggest complaints about XHTML is that all markup - element names, attribute names, and even some attribute values - must be lower case. Upper case and mixed case will both generate validation errors.

For some developers, this writer included, this has meant a fairly drastic change in hand-coding style, and the transition hasn't always been smooth. For tool developers, it can be even more of a problem, requiring picking through code to find all the markup and convert it to consistent lower case.

why

Despite the cost, there are some solid reasons why the XHTML Working Group had to choose a particular case for markup and stick to it.

The W3C Working Group that created XML considered internationalization a critical issue for building infrastructure on the Internet. XML moved away from the focus on ASCII or ISO-8859-1 (Latin-1) that SGML and HTML had often had, and built itself on Unicode.

Unicode provides enough character positions for nearly every language in the world, along with facilities for extensions and private character areas that should keep it from hitting a wall in the foreseeable future. Effectively, Unicode gives developers a way to mix English, Chinese, Basque, Hindi, Korean, Vietnamese, Russian, Japanese, Arabic, Urdu, and many many more languages in a document without having to split the document into smaller pieces or use strange escape sequences.

One consequence of using Unicode for markup - and in particular of allowing non-Latin characters to be used in element and attribute names - is that there are many languages that don't recognize conventions like upper and lower case. Requiring that XML processors perform case-folding brings a new level of complexity to the parsing operation. Since XML parsers were supposed to be reasonably simple and even small, this could have complicated matters enormously.

As a result, the XML Working Group decided that XML markup would be entirely case-sensitive, making br and BR two entirely different elements. It's much simpler and leaves far less room for conflicting interpretations of the same document.

It's not entirely clear why the XHTML working group chose to use lower case for XHTML markup rather than upper case, but it is clear that they would have faced complaints from partisans of the losing side in either case.

how to deal with it

For XHTML developers, the impact of this decision is pretty simple all markup must be in lower case. This means that all element names, attribute names, and some attribute values must be in lower case.

For element names, this means img instead of IMG, blockquote instead of BLOCKQUOTE, table instead of TABLE, and so on.

For attribute names, this means href instead of HREF, onclick instead of onClick, input instead of INPUT, and so on. The entire attribute name must be in lower case, even when it is a combination of two words, like most of the event handling attributes.

For attribute values, case sometimes matters and sometimes doesn't. Some attributes, like the alt attribute of the img element, are designed to accept free-form text descriptions. Similarly, domain names within URLs are not case-sensitive, so you can still write "http://SIMONSTL.com" instead of "http://simonstl.com" if you like - unless you're using the URI in a namespace declaration (like xmlns).

However, in cases where you select a value from a range of choices, you must use lower case. To create a text input on a form, you might use:

<input type="text" name="email" maxlength="255" size="20" />

but you can't use

<input type="TEXT" name="email" maxlength="255" size="20" />

Similarly, the method attribute now takes get or post, not GET or POST.

Generally speaking, if the value of an attribute is a choice that directly changes the way your XHTML is presented, it needs to be lower case. If it's pointing to an external resource, providing alternate text, or providing a numeric (hex) value, case still doesn't matter. When in doubt, use lower case.

The new pieces XHTML adds to HTML - like the XML declaration at the start of the document, need to be in lower case as well. (The DOCTYPE declaration should retain its familiar mixed case.)

The contents of comments and elements may of course use whatever case you need to properly convey your documents' message.


2 September 2000 - Empty elements and empty tags

HTML has always had a number of elements which don't include textual content directly, including:

In HTML, all of these elements were typically represented with just a start tag. Browsers understood that these elements didn't have textual content, so they never worried about finding an end tag for the element.

XML doesn't assume that browsers or other applications know anything about the vocabulary, so it doesn't allow document authors to assume their applications are very smart. All elements must have a clearly defined beginning and end.

An empty element may be represented in XML syntax by a start tag immediately followed by an end tag:

<br></br>

Alternatively, XML provides a shortcut syntax, called an empty tag, that puts a slash (/) right before the closing pointy bracket (>):

<br/>

XHTML uses these empty tags for its empty elements, but it does so with a slight tweak to get around incompatibilities with a wide range of older browsers. Many browsers have problems with <br></br> (producing an extra line break) and with <br/> (they either don't recognize it as a line break or they display the slash immediately afterward.)

To get around these difficulties, the XHTML 1.0 Recommendation suggests that all empty tags include a space before the slash:

<br />

The same approach can be used with empty tags containing attributes:

<img src="mypic.gif" alt="picture of me" />
<input type="text" name="email" maxlength="255" size="20" />

This relatively simple workaround gives XHTML developers the best of both worlds proper display in HTML browsers, and clean structures that can run through XML parsers and into XML tools. Empty tags are both a shortcut for developers and a workaround that helps XHTML fit into the existing HTML world.


1 September 2000 - Think in elements, not tags

When I first learned HTML, it was all about using tags to mark up and format text. Start tags were the main weapon, and end tags were useful for turning off whatever I'd done with a start tag.

I rarely used </body> and </html> - browsers didn't care, so why should I? Similarly, I never marked the start of paragraphs, since the browser could accept <p> as the equivalent of a paragraph mark.

XHTML requires a different approach, taken from XML. Tags are just markers for the beginnings and ends of elements, and these must be marked explicitly. It's the elements that matter, not just the tags. Elements may contain other elements, but they can't contain a portion of another element. All nesting must be clean and explicit.

For example, this markup is illegal in XHTML (and XML)

<b>This is bold. <i>This is bold italic.</b> This is italic.</i>

Because the b and i elements overlap, this markup isn't proper XML or XHTML. To make it work, you need to write

<b>This is bold. </b><i><b>This is bold italic.</b> This is italic.</i>

or

<b>This is bold. <i>This is bold italic.</i></b> <i>This is italic.</i>

Why? XHTML expects elements to be containers, not just formatting markers in a stream. This container-based approach makes it much easier to process and store information based on its structure, and is critical to XHTML's eventual goal of letting developers mix and match new vocabularies with 'classic HTML'.


For more information about this list, visit the main page.

Copyright 2000 by Simon St.Laurent