Introduction


Prev		Next

XProc 1.0 [1] is clearly an XML-centric language to design workflows, actually it is mostly an XML-only workflow language. To quote the XProc 1.0 specification:

Although some steps can read and write non-XML resources, what flows between steps through input ports and output ports are exclusively XML documents or sequences of XML documents.

This XML-only approach proved to be fine for a lot of tasks, but as it turns out, even workflows dealing mostly with XML documents also have the need to deal with non-XML data. Just think of an ePUB mostly containing XHTML documents, but also having some JPEGs with illustrations, a manifest file which is pure text and finally being essentially a special kind of ZIP-document.

As such workflows show up quite often in real day life, the ability to deal with non-XML documents was a high priority requirement when developing the next version of XProc, which is called "XProc 3.0".^[16]

In this paper I would like to give an introduction to workflows for non-XML documents in XProc 3.0. To do this as practically as possible I decided to layout a typical workflow involving the necessity to deal with non-XML documents and to show, how this could be done in XProc 3.0. Of course the workflow is a little bit of a made-up story because it was chosen for the purpose of demonstration. But it will show some basic structures of dealing with non-XML documents in XProc 3.0 and can serve as a blueprint for real life projects.

The workflow discussed here is this: We have a bunch of ePUBs in a folder somewhere and we have been asked to design a workflow which analyses the content of the ePUB and creates an RDF metadata description and an inventory in JSON which has to be sent to one of our inventory-servers which – for whatever reason – happens to understand only JSON. I will explain the details of this workflow later, but please keep in mind that it involves dealing with a lot of non-XML documents such as ZIP (the ePUB itself), plain text, graphics in JPEG, RDF and last but not least JSON.

This paper is divided into four parts, the third part being the central one:

We will start with a short reminder on how one could deal with non-XML documents in XProc 1.0. As I said before, XProc 1.0 clearly is an XML-centric language, but there are some possibilities to deal with non-XML documents. To give a short reprise will hopefully help to understand the new features of XProc 3.0.
In order to cope with non-XML documents, some fundamental changes had to be made in the transition from XProc 1.0 to the new XProc 3.0. The most important point here was to change the concept of a document that flows from one step to another. While in XProc 1.0 a document is a well-formed XML document only, for XProc 3.0 we needed a new document model which is able to cover XML and non-XML documents as well. In the second section of this paper we will take a short look at this new concept of a document as a foundation to understand how non-XML workflows can be created in XProc 3.0.
The core of this paper is the third section where we will discuss in some detail how to design the sketched workflow for ePUBs. Here you will see the new document model in action and get to know some of the new XProc steps, which take advantage of this model and enable you to write workflows for non-XML documents.
In the last section you will find a short summary and some conclusion concerning the suitability of XProc 3.0 for non-XML workflows.

A short caveat before we start: By the time of writing this article (May 2018), XProc 3.0 is still work in progress. While the document model can surely be seen as stable, the steps mentioned later are probably not. Though their basic outline is very unlikely to change, details may change as the discussion goes along. Especially the signature of the newly introduced steps and the dynamic errors raised by these steps might be subject to change in the process of standardising XProc 3.0. As a result of this situation, this paper is clearly not suitable as a tutorial on XProc 3.0, but is intended as a first look at the new possibilities in this language. Before you start to develop your own pipelines, please see http://xproc.org for the latest version of XProc 3.0.

^[16]XProc 3.0 is currently developed by the XProc Next Community Group (https://www.w3.org/community/xproc-next/). The most recent version of the editor's draft is published on https://spec.xproc.org.