Gorille - Up Close and Personal with XML and Unicode Characters

Gorille is a small Java package designed to let developers of various kinds of XML processors test the content and names of XML structures in their XML documents. While Gorille ships with test files for both XML 1.0 and the draft XML 1.1, you can create your own configuration files as well.

Gorille uses an XML format to specify lists of characters according to either XML 1.0 conventions (with its BaseChar, Ideographic, CombiningChar, Digit, and Extender productions) or XML 1.1 conventions (NameStartChar, NameChar). Both forms permit specification of the Char and S production for content characters and whitespace. I've included sample lists for both XML 1.0 and XML 1.1, as well as an ASCII-only version of XML 1.0. (Gorille 0.4 added functionality to compile these lists into code, avoiding the loading process at startup.)

Gorille performs checking of Name, Names, QName, NMTOKEN, and NMTOKENS, as well as character checking for any of the productions listed above. This checking is performed by XML parsers as documents are parsed, but Gorille may be useful for checking XML documents generated by programs or to restrict documents to subsets of the characters allowed by XML.

Gorille relies completely on Java's built-in support for Unicode strings and characters, though it doesn't use any of the Unicode property information Java provides (in java.lang.Character and java.lang.Character.UnicodeBlock). Starting in version 0.3, Gorille provides support for the Surrogates Area (13.4) of Unicode (U+D800-U+DFFF) and for characters above 10000 represented by surrogate pairs (3.7). Java itself doesn't recognize these characters as such, but does permit their inclusion in strings as UTF-16 code points.

Gorille does permit some rather perverse modifications of the productions - you could, for instance, require that all content be in control characters while all names be ideographic - but my hope is that developers will use it in reasonable ways which don't create arbitrary explosions as programs reject bad information.

The Gorille package includes a SAX Filter which tests values. Gorille will eventually be used to provide name- and content-checking for MOE. A Java FilterReader for preprocessing content before it reaches a parser would also be another option, though that work on that has not yet begun.

At the moment, I'm putting most of my attention into the Ripper class, which reports XML documents as a context and character events. While Ripper is not an XML parser, it is designed as a foundation for processing XML at the character level and possibly for the creation of more proper XML parsers.

Gorille is currently in alpha. The core functionality seems complete, but there's still potential for improvement, expansion, and as always, better documentation. (Including RDDL documents for the character list and test files!) The Ripper code in particular is liable to change substantially.

Gorille is distributed under the Mozilla Public License 1.1. For more information, see the javadoc .

A download is available.