Reprise: Non-XML documents in XProc 1.0

First a short reminder of what XProc 1.0 is, or I should say, was, because I believe pipeline authors will appreciate the new possibilities of XProc 3.0 so much, that they will switch to the new language version as soon as possible. One way to describe XProc 1.0 is to say that it is a pipeline language to design workflows for XML in XML. This definition is based on the fact, that only XML documents are allowed to flow between the steps that make up the pipeline. This feature of XProc 1.0 is most visible in an error message one gets to see from time to time when working with XProc 1.0:

It is a dynamic error (XD0001) if a non-XML resource is produced on a step output or arrives on a step input.

Please note that this is dynamic error number 1. Although this error message is sometimes annoying, it clearly states the nature of XProc. XProc was invented and developed with the goal to be able to specify a sequence of operations to be performed on a collection of XML input documents. Of course the creators of XProc knew that there are not only XML-documents, so there was a section on "Non-XML documents" in the original specs. This section is about thirteen lines and makes a distinction between almost-XML documents (HTML) and non-XML documents. For the first there is the ability to turn it into XML, and for the second a mechanism is presented, to let them flow quietly through the pipeline: The non-XML document is either converted into text or base64-encoded and wrapped in an element node with a document-node, which means it now can be processed like any other XML document.

This was of course an elegant way to deal with non-XML documents in XProc, but it also meant that these documents can only "go with the flow". XProc itself defines no steps for these documents, the only interesting things you could do with them is to send them to a web service or store them on your disk, provided they are not base64-encoded or your XProc processor implements the optional feature of decoding base64-encoded documents before it stores them. Of course you could store them anyway, but storing a base64-encoded JPEG wrapped in an XML element does not count as interesting, does it?

And as this situation became unpleasant, different mechanisms were invented to go around it. Steps were invented to deal with ZIP archives because we all know, that sometimes XML documents are packed into ZIP or should be packed into ZIP. But as ZIP documents could not flow through a pipeline themselves and although they may contain flowable XML-documents, they typically encapsulate a lot of stuff that could not flow with a pipeline. This is particularly problematic if you going to pack a ZIP, have a lot of XML-documents to put in, but also need some images and/or a text document.

To invent a workaround they probably looked at ANT and did what ANT does: Read and/or write to the file system. So unpacking a ZIP reads a file from the file system and creates a lot of files with the unpacked content on another place in the file system. Some steps just read from the file system but produce an XML-document that flows in the pipeline, e.g. XMLCalabash's extension step cx:metadata-extractor which reads an image and produces an XML-document containing the image's metadata. Other steps take XML-documents as input but write their non-XML result to the file system, like XProc's extension step p:xsl-formatter which typically produces PDF.

Some of these steps are really handy, some are mere workarounds. But from a purist perspective they are all workarounds because XProc is designed to be a language where documents flow between ports, and reading and/or writing to file systems is clashing with the style. But even if you do not share a puristic approach to language design, there are some problems with this approach: