Gerrit Imsieke ran Saxon in profiling mode and discovered that, as mentioned in the previous section, merging the list of seen attributes was very slow. Not merging them saved a lot of time but produced wrong answers.
This is when the author started to suspect that the entire input was being kept in memory. The Saxon uri-collection() function was being called with a stable=no parameter, which the documentation seemed to suggest would mean the documents would not need to be kept in memory.
It appears, based on testing, that in fact stable=no simply means that calling uri-collection() multiple times in the same run might not always return the same set of document URIs (filenames). It does not mean that the documents themselves are not guaranteed to be stable, and hence does not mean the documents are not kept in memory.
In the end what worked was processing each input document in an external stylesheet, called using the XPath 3 fn:transform() function. Since each invocation of XSLT was separate, it seemed memory was not retained between them, and FreqX ran much faster.
xsl:sequence select="transform( map { 'stylesheet-location' : $process-file-xsl, 'initial-template' : QName((), 'initial-template'), 'stylesheet-params' : map { QName('http://www.delightfulcomputing.com/', 'freqx-control-doc') : $freqx-control-doc, QName('http://www.delightfulcomputing.com/', 'freqx-input-uri') : $this?name } } )?output/*" />
Performance can to some extent be measured using the builtin profiling
in Saxon; an alternative is to transform the XSLT style sheet to add
xsl:message
instructions at the start and end of each
template of interest, and then to analyze timestamps on the log
file.
Since XPath functions are deterministic, functions such as fn:current-time() always return the same value within a single XSLT episode. Therefore the time for profiling must be reported either with a Java native method call, or by using an external tool such as the combination of ts and unbuffer from the Linux more-utils package. The ts command adds timestamps to each line of input. However, program standard output is buffered for efficiency and is delivered in clumps when the buffer is full. The unbuffer command can be used to prevent that buffering and get accurate timestamps.
Timings can also be obtained by separate runs of the external stylesheet that would be called using fn:transform(), and memory can be measured, for example on Linux or Unix systems with /usr/bin/time (time without the path is a built-in in many shells, including bash, that gives less information).
People running timings need to keep overall system activity in mind, as well as overall system memory usage.
In the case of FreqX, memory rose to over sixty gigabytes on a test collection, and it became clear it was keeping all of the documents in memory.
Changing FreqX to use fn:transform() on the result of doc($filename) did not help.
Passing $filename as a parameter to the external stylesheet did help: runtime was reduced dramatically, as was memory usage.
A further refinement was to keep namespace URIs as integer keys into a map instead of strings, but the additional complexity of passing that back from the external stylesheet did not seem justified; the author may return to this in the future.