Generating the generator

Thus I fell back on old habits, and began writing what I thought would be a short routine to write a moderately long regular expression. Because I had used Perl for this in the past, and because Perl is interpreted (and thus an easy language with which to perform rapid cycles of tweak-and-test), I wrote this program in Perl. This, I now believe, was a mistake.

The immediate output of the program was to be a regular expression — that is, a string — which I imagined I would generate once (after building and debugging the generation routine) and copy into an appropriate schema. Thus a string manipulation language like Perl seemed appropriate. However, in my zeal I forgot a universal truth about writing programs, even simple ones: they need to be tested and debugged, repeatedly and thoroughly. In this case each round of testing required that the string be copied from standard output into a schema against which some test data could be validated. (Remember that I could not use Perl to directly test the regular expression against test data, because I was not generating a Perl-flavored regular expression, but rather an W3C-flavored regular expression.) Thus in order to save time, it made sense for the program to either insert the regular expression into the test schema for me, or to write a complete test schema (that includes the regular expression) anew each time it was run. While the former technique is perhaps more desirable from a point of view of separation of concerns, the latter is much easier to write and is preferable insofar as it keeps all the concerns (as it were) in one file.

My preferred schema languages are RELAX NG and ISO Schematron,[38] either of which can be used to test the value of tei:rendition/@selector against a W3C-flavored regular expression. Thus I soon modified the generation program so that instead of writing just a regular expression to standard output, it wrote a small, but complete RELAX NG schema or a small, but complete, XSLT program, either of which was designed to test only the value of selector against the (current version of) the generated regular expression.

The reason for generating XSLT output instead of ISO Schematron output was purely pragmatic. The Schematron processor I use works by converting the Schematron to an XSLT intermediate (using XSLT), and then transforming the test document using the intermediate XSLT. By writing XSLT directly from the generation program, I could save a conversion step during each test and still use the same engine to execute the regular expression.

Details about the design of the output RELAX NG schema and XSLT program follow. But their mere existence explains why my use of Perl was a mistake. Both of the desired output formats were XML, and for me XSLT is the best language to use for generating XML as output.[39] (Even those who do not think of XSLT as the best language for writing XML will admit that it is far better than Perl.)



[38] That is, my preferred schema languages other than TEI PureODD.

[39] I believe it is very advantageous to use a language, like XSLT, that outputs a tree in serial format as XML — rather than as a sequence of characters, some of which are pointy brackets — and thus cannot make most well-formedness errors. Without such a language, simple well-formedness errors creep in constantly. Even in the 297 XML files that make up the XML version of the W3C CSS3 Selectors Test Suite Index, which look to me like they are generated by a program, I found four files with one well-formedness error each. (<br> without an end-tag in all four cases).