Implementation

An early version of FreqX used a recursive template to process one input file at a time. This quickly ran out of memory, and was replaced with the XSLT 3 xsl:iterate instruction. This still ran out of memory, and we shall return to this topic in a later section, but it handled many more documents.

The first strategy was to produce an XPath map structure from each input document, containing a list of all distinct elements found, and all attributes and all attribute values, and then to merge the maps at each iteration.

This strategy turned out to use too much memory. A rewrite using elements instead of maps to represent the data was faster and used less memory.

It’s entirely likely that the problem was actually not that maps were slower than elements in Saxon 9, but that it’s easy to include a fragment of an input document in a map, and, unlike when it’s copied into element content, the fragment is a live node in a document tree, and hence the entire document must be kept in memory to support possible XPath navigation away from the node. This is especially easy to do by mistake with attribute nodes.

A way to check for this can be to run a stylesheet in XSLT streaming mode, because any template that then tries to return data from an input stream will raise an error. Nodes must be “grounded” instead, for example, using the XPath 3 fn:copy-of() function. However, converting a transformation to work correctly in streaming mode is generally non-trivial.

The current implementation makes an e element for each element in the input, and an at element for each attribute seen. These are then turned into distinct values at the end of processing each input file, and converted into count elements that represent totals seen so far.

The process of combining at elements repersenting the set of attribute values seen so far turns out to be slow. It might be that a map would be faster than the current method; profiling showed it was slow, and a workaround was to merge attribute values after every hundred input documents, and at the end of processing:

<xsl:with-param name="attribute-values-seen"
  select="
    if ((position() ge last() - 1) or (position() mod 100) eq 99)
    then
      dc:merge-attribute-values(
        $attribute-values-seen,
        $this-data-set[local-name() = 'at']
      )
    else (
      $attribute-values-seen,
      dc:merge-attribute-values(
        (),
        $this-data-set[local-name() = 'at']
      )
    )
  " />

The merge-attributes-seen() function makes a count element for each distinct attribute name/value combination; there can easily be millions of these, and as the sequence of pairs already discovered grows large, the process slows down considerably. So there is a trade-off between extra memory for duplicated attribute/value count elements and CPU time. Running merge-attributes-seen() only at the end of processing can use a lot of memory, but running it for every document is slow.

Although the same consideration applies to element names and to attribute names, there are generally many fewer of these.