Introduction

XProc 1.0 [1] is clearly an XML-centric language to design workflows, actually it is mostly an XML-only workflow language. To quote the XProc 1.0 specification:

Although some steps can read and write non-XML resources, what flows between steps through input ports and output ports are exclusively XML documents or sequences of XML documents.

This XML-only approach proved to be fine for a lot of tasks, but as it turns out, even workflows dealing mostly with XML documents also have the need to deal with non-XML data. Just think of an ePUB mostly containing XHTML documents, but also having some JPEGs with illustrations, a manifest file which is pure text and finally being essentially a special kind of ZIP-document.

As such workflows show up quite often in real day life, the ability to deal with non-XML documents was a high priority requirement when developing the next version of XProc, which is called "XProc 3.0".[16]

In this paper I would like to give an introduction to workflows for non-XML documents in XProc 3.0. To do this as practically as possible I decided to layout a typical workflow involving the necessity to deal with non-XML documents and to show, how this could be done in XProc 3.0. Of course the workflow is a little bit of a made-up story because it was chosen for the purpose of demonstration. But it will show some basic structures of dealing with non-XML documents in XProc 3.0 and can serve as a blueprint for real life projects.

The workflow discussed here is this: We have a bunch of ePUBs in a folder somewhere and we have been asked to design a workflow which analyses the content of the ePUB and creates an RDF metadata description and an inventory in JSON which has to be sent to one of our inventory-servers which – for whatever reason – happens to understand only JSON. I will explain the details of this workflow later, but please keep in mind that it involves dealing with a lot of non-XML documents such as ZIP (the ePUB itself), plain text, graphics in JPEG, RDF and last but not least JSON.

This paper is divided into four parts, the third part being the central one:

A short caveat before we start: By the time of writing this article (May 2018), XProc 3.0 is still work in progress. While the document model can surely be seen as stable, the steps mentioned later are probably not. Though their basic outline is very unlikely to change, details may change as the discussion goes along. Especially the signature of the newly introduced steps and the dynamic errors raised by these steps might be subject to change in the process of standardising XProc 3.0. As a result of this situation, this paper is clearly not suitable as a tutorial on XProc 3.0, but is intended as a first look at the new possibilities in this language. Before you start to develop your own pipelines, please see http://xproc.org for the latest version of XProc 3.0.



[16] XProc 3.0 is currently developed by the XProc Next Community Group (https://www.w3.org/community/xproc-next/). The most recent version of the editor's draft is published on https://spec.xproc.org.