Parsing Errors

Saxon extends the fn:uri-collection() function to be able to descend recursively into a directory (folder) structure and return all files found whose names match a pattern. FreqX uses this as one of its mechanisms to obtain a lit of files to analyze. But not all input files will always be DTD-valid, and some may not even be well-formed XML at all, regardless of file name.

FreqX wraps each call to open a file (using the fn:doc() function) inside xsl:try and xsl:catch, so that parse errors do not terminate processing. It then records the errors. Since there may be a great many errors, by default only the first fifty are recorded, and are included in an expandable summary/details section in the HTML report.

The most common parse error is not finding a DTD. The FreqX wrapper script takes optional filenames for a Java catalog resolver class and an XML catalog file. However, Saxon only supports using one such catalog file in any given run, and a large corpus might well contain documents using different incompatible versions of a DTD identified by the same identifier, whether SYSTEM or PUBLIC.

FreqX can be run multiple times, and the counts combined. A future version might be able to process only the failed documents from a previous run, so that one can more easily combine results from runs with different XML catalog files.