The pipeline development was not without drawbacks: The first (and natural) approach to processing the documents in the input folder would be:
<p:for-each> <p:xslt> <p:with-input port="stylesheet" href="the-stylesheet.xsl" /> </p:xslt> </p:for-each>
However, if you process around 15,000 source documents, the stylesheet document is loaded each time too. Since this is completely inefficient, we changed it to:
<p:load href="the-stylesheet.xsl" name="stylesheet" /> <!-- ... --> <p:for-each> <p:xslt> <p:with-input port="stylesheet" pipe="@stylesheet" /> </p:xslt> </p:for-each>
This is a bit better performance-wise, but makes the pipeline more difficult to read. Moreover, it only partially resolves the inefficiency issue because the stylesheet has to be compiled every time the <p:for-each>
is executed. The same applies to the XML schemas or the Schematron. They have to be prepared every time to be usable, although they do not change in our case. Using an XML catalog does not help here, because it only caches the document. The most effective solution would be to cache the ready-to-use stylesheet etc. But simply caching them as default processor behaviour is not feasible in XProc 3.0. Looking at the specifications, it is perfectly possible to rewrite documents or stylesheets etc. within <p:for-each>
so that a later iteration depends on documents produced in an earlier one.
A processor could perform some optimisation here by checking whether a certain document is written inside a <p:for-each>
. However, since catalog resolution etc. frequently takes place in XProc, there is no simple and reliable way to do this. From our perspective, some more investigation of this problem is necessary: One solution could be an (extension) attribute on <p:with-input>
allowing a pipeline author to declare a stylesheet, schema etc. to be cacheable. Whether this is implemented in XProc at language level or takes the form of a vendor-specific extension is a discussion for the future.