Learning iXML

Steven Pemberton often runs tutorials for learning iXML at markup conferences and, although it wasn’t on the schedule, he very kindly tacked one onto the end of a course I was attending on XForms, during the 2024 XML Summer School, in Oxford.

If you’re unable to attend one of Steven’s tutorials, Norm Tovey-Walsh has created an online tutorial Writing Invisible XML grammars. The Invisible XML Specification is published at https://invisiblexml.org/1.0/ and I recommend paying particular attention to the Complete Grammar section so you’re aware of the pre-defined rules; it’s also a great example of what an iXML grammar looks like and immediately after it is an example of the XML that would be generated if you applied the iXML grammar to itself. The spec is developed and maintained by the W3C Invisible Markup Community Group, who have a email discussion group that anyone can message and a public archive that you can search (in case your question has been raised and discussed before). There is a list of grammars on https://invisiblexml.org/ but in general I’ve struggled to find publicly available iXML grammars.

For my own first attempt at writing an iXML grammar, I chose the DTD document type (DOCTYPE) declaration “string” as the text I wanted to parse. Although DTDs have technically been superseded by more modern schema languages, they’re still fairly commonplace. The DOCTYPE declaration can be used to associate an XML document with an external Document Type Definition (DTD). I chose external because it’s much simpler than internal and the use-cases I had in mind all related to external definitions.

When working with XML documents, sometimes it’s useful or necessary to know what the exact content model is. In human mode, I can open the document and read the schemas association information; if that’s not there, we’re into a whole other problem that’s out-of-scope for this particular use case. Programmatically, it’s not so easy, as that information is discarded when a DOM object is created.

Another reason for choosing the DOCTYPE declaration was that the rules for writing one are explicitly defined in the specification for XML[XML1-0] so I wouldn’t have the additional overhead of trying to infer them from sample documents.

Once I started drafting my iXML grammar, I soon wanted to test it out and “eyeball” what was coming out. We hadn’t been running iXML transformations locally on our laptops during the tutorial (because it was unplanned) so I wasn’t already set up for it. Happily, Google led me to CoffeePot, an Invisible XML processor, developed and maintained by Norm Tovey-Walsh[CPOT]. It’s distributed as a jar file, can easily be run from a command-line and has a useful manual. For a list of alternative implementations, see invisblexml.org.

Next, I needed a sample input document. This is when I discovered that it wasn’t possible to use iXML itself to find and isolate the DOCTYPE declaration within the XML document; unlike a regular expression, an iXML grammar must match the entirety of the input document. At this point, as I was simply looking for sample input to experiment with, I should have just manually copy and pasted the DOCTYPE declaration string into a file of it’s own. However, I had it stuck in my head that my real-world use-case was to programmatically parse the string and convert it to XML. So I created an ANT macro to load the XML document as text and use a regular expression to find and extract the DOCTYPE declaration. I could have made my regular expression more complex in order to identify the component parts of the DOCTYPE declaration but that would have defeated the point of the learning exercise. Also, one of the advantages of an iXML grammar over a regular expression is readability. Understanding even a slightly complex regular expression written by someone else - or by myself a while ago - can be challenging whereas an iXML grammar is commonly expressed as named components so it’s easier to identify the logical “building blocks” and how they relate to each other.

Figure 3. Single-line text file containing a DOCTYPE declaration for XHTML

Single-line text file containing a DOCTYPE declaration for XHTML

Figure 3 shows an example of a DOCTYPE declaration string after it's been extracted from an XHTML document and saved separately in a text file.

Now I was ready to test.