FreqX is a tool that produces reports about elements and attributes found in a body or corpus of XML documents. Some uses of FreqX have included:
Researching elements that could be dropped from a new version of a vocabulary;
Investigating whether there were values of role or class attributes that were used frequently enough to suggest a new element to represent the concept;
Investigating a large body of documents (thousands or tens of thousands) as part of maintaining or writing transformations or other software;
Producing pretty reports for conference papers or client reports (the importance of this should not be underestimated).
This paper discusses some of the challenges that one encounter when writing such a tool, and some of the (often arbitrary) design decisions taken.
Some of the challenges included:
Supporting multiple ways to specify which documents to process;
Handling documents that has parse errors in them;
Running in a reasonable time;
Not running out of memory;
Coping with documents requiring different DTD files for the same PUBLIC identifier;
Initial FreqX development was sponsored by Mulberry Technologies. They wanted the tool written in XSLT since that was their primary language, and since it’s also the author’s, this was a good fit.
Subsequent development was funded by Delightful Computing, with help from Gerrit Imsieke of Le-Tex Publishing and others.