XProc 3.0's new concept of a document

Before we look at the new document model introduced with XProc 3.0 in detail, let us first look at the challenges the XProc Working Group had to face in order to allow processing of non-XML documents. In XProc 1.0 everything that could possibly flow from the output port of one step to the input port of another step, was a well-formed XML document. And every step defined in the step library, if it has an input port at all, it has an XML input port. This situation changes dramatically, once it was decided, that not only XML-documents could flow, but a wide variety of documents of different flavors. Then we have to differentiate between steps which could handle a certain kind of document, and steps, that have to raise an error because they could not possible do something useful with the document kind in question: Adding an attribute with a certain name and a certain value clearly only makes sense, when the document in question is an XML document. We do not even know what it could possibly mean to add an attribute to a text document or an image. Of course getting the image dimensions only makes sense of an image, but not for a text document or an XML document.

The other side of the same coin is a situation, when a pipeline author by accident exposes an image document to a step, that expects an XML document. Parsing an image document with an XML parser will most certainly lead into an error and the pipeline breaks. To be able to debug this kind of situation it is most useful not to get the information, that something is wrong with the XML document, as the parser will tell us, but to state, that a non-XML document appeared somewhere, where only an XML document is allowed. Thirdly, from the perspective of an XProc processor, it is important to know the kind of document it is dealing with, so it can choose an internal data model, that is suitable for the document type in question and for the operations defined for documents of this type.

The practical upshot of this reasoning is, that we need two things: We need to know what type a specific document has and we need a label for XProc steps, to say what kind(s) of documents they can deal with or allow on their specific input ports.

In order to cope with these two requirements, the XProc Working Group had to develop a completely new understanding of what flows on an XProc pipeline from step to step. What flows is still called "a document", but a document is now a pair consisting of a representation and the document properties. The representation is a processor specific data structure which is used to refer to the actual document content. The document properties are pairs of keys and values containing metadata of the content. The type or kind of the document flowing between the steps is the most important metadata and it is associated with the key content-type. XProc 3.0 uses the well known media type notation like application/xml, text/plain, image/jpeg and so on, to distinguish different types of documents.

Steps in XProc 3.0 now also use the media type notation to declare, which kind of documents are expected on a specific port. If a document that matches the step's specification arrives, everything is fine: The step can perform the expected operation and new documents (pairs of representation and document-properties) are produced on the step's output ports. But if an incoming document does not match the content-types expected by a step on a specific port, an specific error is raised by the processor, telling e.g. that step add-attribute expects a document with one of the media types application/xml, text/xml or application/*+xml but a document with content-type text/plain was found. Which content-types are expected on which ports is of course determined by the inner logic of the step: Steps like p:identity can obviously deal with any kind of document because it only passes through the documents appearing on this input port. The same is true for p:count which counts the number of documents on an input port and does not have to know, what kind of documents they are. But most steps known from the XProc 1.0 step library typically require a document with an XML media type to appear on its input ports, because adding an attribute, replacing an element, renaming a namespace etc. only makes sense for XML documents.

The new concept of a “document” in XProc 3.0 consisting of a representation of a content of a certain type and the document properties as its metadata as we will see nicely solves the problem of opening up the well known XProc conception to non-XML documents. Regardless of the type of the documents, they are produced by a step and exposed on one of its output ports. The XProc processor then sends them to another step's input ports and will raise an error, if the document's type does not belong to the types of documents the step in question is able to deal with. The processor is able to do this, because the document properties flow with the documents between the ports and so a step knows what type of document is coming in.

But what about the pipeline author, how is she able to access the document properties? You can easily imagine situations where a pipeline author might want to make a decision based on the type of the document, for example because the output of a certain step should be sent to step A if it is an image, but to step B if it is of some other type. To make this kind of processing possible, the document properties of a document are exposed to the pipeline author via a bunch of XPath functions making mostly use of the map type introduced with XPath 3.1 [2]. More precisely the document properties are represented as map(xs:QName, item()) and you can access them as a map by using the XProc extension function p:document-properties($doc). Most of the time pipeline authors will not be interested in the full map, but want to retrieve a specific value. This can be done using p:document-property($doc, $key). And finally there is a function called p:document-properties-document($doc) which returns an XML document for the document properties of the document in question. In this document each key of the map becomes an element and the value of the key becomes the element's values. In this way pipeline authors can retrieve and evaluate the document properties of a document using just familiar XPath expressions and do not need to use the new expressions introduced for maps and arrays.

Now that we have seen how the document properties of a document in XProc 3.0 are accessed the question may come up, how to control or set the document properties in a pipeline. Typically in most cases you do not have to do this, because the XProc processor is responsible for this: If a step declares the documents on an output port to be, say of type text/plain, the document properties of the resulting documents are set by the processor accordingly. But sometimes obviously pipeline author need to control the document properties themselves, e.g. when you are loading a document and do not trust your file system to get the mime type of the document right. Another use case for setting document properties in a pipeline is when you create a document inline and want to tell the processor explicitly which document type the document has to have. For those cases XProc 3.0 provides an additional option named document-properties which takes a map with document's metadata as its values. Finally, if you need to add additional metadata to a document created by one step before it goes into another step, there is a new step called p:p:set-properties, which can be used to overwrite existing metadata or add additional data to the document properties.

Having talked about the new concept of a document being a pair of a representation and its document properties and having discussed the document properties to some extent, the next thing we have to cover is the representation part of the document pair. As said above, the representation is a data structure used by the processor to refer to the document's content. Which representation a processor has to use in order to deal with the content of the document is defined based on the document's media type. The current version of the specs (May-2018) calls out four different types of representations: Obviously there are XML documents identified by an XML media type, which is “application/xml text/xml application/*+xml”. Secondly the specs mention text documents which are identified by media type “text/*”. Thirdly there are JSON documents which have media type “application/json” and forth there are the so called binary documents which are identified by any media type not mentioned yet. Implementers are obviously free to implement additional document types identified by media types, as long as they do not conflict with the ones mentioned before.

As we learned before, the XProc 3.0 specification defines a representation for documents of a specific media type, where "representation" means a data structure used by an XProc processor to refer to the actual document content. And this brings us to another important change, the XProc working group decided to make for XProc 3.0, because they say in the specs:

Representations of XML documents are instances of the XQuery 1.0 and XPath 2.0 Data Model (XDM).

Therefore XProc 3.0 uses the same data model as other XML related technologies like XQuery or XSLT. And this is a strong change in the concept of what an XML document is. In XProc 1.0 the concept of a well-formed XML document was used, i.e. every document had exactly one element node, optionally preceded and/or followed by comment nodes, processing instructions and all whitespace text nodes. Well-formed XML documents are fine, but the stipulation, that every XML document has to be well-formed all the times is very burdensome when you try to make even very slight modifications. One thing I ran in a lot of times for one specific project, was the problem to add a processing-instruction as first node of the preamble, so the browser could recognize the XForms in the produced documents. This might sound like an easy task for someone familiar with XQuery or XSLT, but in XProc 1.0 it was tricky. The natural choice is of course p:insert where one matches the root element of the document, and tell the processor to insert the processing instruction before the root element. But you can not do this, because p:insert inserts documents into other documents, not nodes. So you can not insert a processing instruction, but have to insert either a processing instruction followed by a dummy element node or wrap the processing instruction into an element node in order to fulfill the “well-formed documents only” rule. But obviously you can not insert this document before the current element node of the document, because a well-formed document can not have two top level elements. So here is one way of doing this:

<p:wrap-sequence wrapper="dummy" />
<p:insert match="/dummy" position="first-child">
  <p:input port="insertion">
    <p:inline>
      <dummy2><?pi target?></dummy2>
    </p:inline>
  </p:input>
</p:insert>
<p:unwrap match="/dummy | /dummy/dummy2"/>

But in XProc 3.0 where an XML document is to be implemented according the XDM [3] concept, one can do it in the most natural way:

<p:insert match="/" position="first-child">
  <p:with-input port="insertion">
    <?pi target?>
  </with-input>
<p:insertion>

The second document type, which is newly introduced with XProc 3.0, is a text document. Text documents are characterized by a media type that matches the scheme text/*, with the exception of text/xml, which is an XML document. Constructing a new text document is as easy as constructing an XML document:

<p:identity>
  <p:with-input>
    <p:inline content-type="text/plain"
      >This is a new text document</inline>
  </p:with-input>
</p:identity>

The XProc processor will produce a new text document on the output port of p:identity. This document will consist of a document node with just one text node child which holds the text. Doing the representation of text documents in this way has the obvious advantage that it fits perfectly with the use of XPath as an expression language in XProc. Suppose you have a sequence of text documents and for some reason you want to treat text document differently whose second word is “is”. Since the text documents in XProc are a special kind of a document as defined in XDM, you can do it as easy as this:

<p:choose>
  <p:when test="tokenize(.,'/s')[2]='is'">
    <!-- ... -->
  </p:when>
  <!-- ... -->
 </p:choose>

To represent text documents as a text node wrapped into a document node also allows us to use them in p:wrap-sequence to wrap an element node around the text and thereby produce an XML document. Of course you can also use p:insert to insert the text node of a text document as a child of an already existing element node. And finally, if you select from an XML document and the resulting nodes are all text nodes, the XProc processor will create a new text document for you. We will see more applications for text documents when we come to discuss our example workflow in more detail, but for now we can record, that text documents in XProc 3.0 are pretty well integrated into the XML universe.

Next up is JSON. Integrating JSON into the XML world was a high priority during the last years and great work has been done to archieve this goal with the new standards of XDM 3.1 and XPath 3.1. If you take a look at XSLT 3.0 you will find that working with JSON feels almost as natural as working with XML. Based on the cited works, JSON is now also integrated into XProc 3.0. As you might expect from the preceding discussion, it is called a “JSON document”. The document properties of a JSON document have a content-type entry that contains a JSON media type like “application/json”. As JSON is a text based format, you can easily construct JSON documents within XProc pipelines:

<p:identity>
   <p:with-input>
     <p:inline content-type="application/json"
       >{"topic":["XProc", "3.0"]}<p:inline>
   </p:with-input>
 </p:identity>

As you might expect, if you are familiar with the treatment of JSON in XPath, the XProc processor will use the function fn:parse-json() on the string supplied and produce an XDM representation of this JSON document. In the given case it will obviously be a map item with one entry mapping a string to an array item containing two strings.

Now this representation is perfectly in accordance with what you might expect if you come from an XPath, XQuery or XSLT background, however it does not quite fit with the XProc concept of documents flowing between input and output ports. And this is because the representation of the JSON document is not an instance of an XDM node, but a map item (or an array item or an atomic value in other cases). And neither a map item nor an array or an atomic value can be the child of a document node per XDM. If you recall the definition of an XProc document you can now understand, why it is not defined as a pair of document properties and a node (which is true for XML and text documents), but as a pair of document properties and a representation. The representation for some document types might be an XDM node, but as for JSON documents it is not.

Let us take a closer look how this concept of a document fits into XProc and XPath. First of all, our JSON document produced on p:identity flows out of the step on an output port which is typically connected to the input port of some other step. As said above, what happens then depends on whether the receiving step accepts a JSON document on the respective input port. For example if it is a p:store the document will be written to some destination as you might expect. But if the receiving step is e.g. a p:add-attribute a dynamic error will occur, because a JSON document is not allowed on the input port of this step. But this is nothing special for JSON documents but applies to all documents in XProc. If, for example, an XML document appears on an input port that only allows JSON documents to flow in, a dynamic error is raised too.

As you can see, JSON documents are first class citizens in XProc 3.0 when it comes to the question of what can flow between steps. But if you are familiar with XProc, you might recall, that documents do not only flow between steps, but can also appear as context items when it comes to evaluating XPath expressions. Here JSON documents do not fit quite as well, because their content is not represented as something which is an XDM node and therefore an XPath expression like "/" can not expose the content to XPath. For "/" the XProc processor is required to construct an empty document node, so p:document-properties('/') will return the document properties of JSON documents as well. To overcome this problem is obviously very easy if you imagine an XProc defined XPath function, which takes the document node associated with the JSON document as a parameter and returns the same representation of the JSON document as XPath's fn:json-doc() would. As of May 2018 you will not find such a function in the specifications for XProc 3.0, but I am pretty sure the community group now taking care of XProc's development will find some way to bridge the gap between JSON documents and XPath expressions.

Finally XProc 3.0 defines a fourth document type called “binary document”. A binary document is actually anything which is not either an XML document, or a text document or a JSON document, or, more precisely which has a media type not associated with these three document types. This document type sums up such different kinds of data as ZIP-archives, all kinds of images, PDF-documents and every thing else we have on our file systems or receive from web services. As for JSON documents the XProc processor is required to construct an empty document node, so p:document-properties('/') will return the document properties associated with this document. How a binary document is represented internally by an XProc processor is implementation defined. And it is obvious that not all binary documents will be internally represented in the same way by an advanced XProc processor: Smaller documents will probably be held in memory for fast access, but if you think about very large documents (as a video or an audio file), some optimization will be necessary. One strategy is to store those files away in a temporary folder and let just references to these files flow between the steps. Only if a step actually needs to access the document's content, the file or parts of it are loaded by the XProc processor.

Because the representation of binary documents has to be implementation defined, XProc 3.0 currently defines no way to access the document's content within an XPath expression. One can easily imagine an XProc defined XPath function returning the document's content as xs:base64Binary or as xs:hexBinary. But the main problem here is, that in most cases you do not want the whole document content, but are only interested in a smart portion of it. For this reason an implementation returning the whole content, which may be very large, and then use XPath expressions to identify the small range the pipeline author is really interested in, would be very inefficient. This problem is not impossible to solve, but the XProc community has not agreed on a solution yet. One way to solve it would to determine the content's size, either as part of the metadata in the document-properties or as a function taking the binary document as argument. This might be complemented by an XProc defined XPath function which allows to select a part of the document's content. The specification of the EXPath binary module could certainly be a role model for solving this problem.[Kosek:Lumley:Binary:Module:1.0]

Together with the new document model, XProc 3.0 introduces a new step to convert or cast the different document types into each other: p:cast-content-type. This step takes an arbitrary document on its input port and the content-type this document should be casted to and returns a casted document on the output port (or throws an error if the XProc processor is not able to perform the requested casting). This abstract characterization is necessary, because this step is a kind of “Jack of all trades” of document processing in XProc. The easiest task this step can perform, is to cast from one XML media type to another, say from “application/xslt+xml” to “application/xml” or vice versa. Here the actual document representation does not need to be changed in any way, just the value of key content-type in the document properties needs to be changed. Casting from a non-XML document type to an XML document type will produce an XML document by wrapping the representation of the non-XML document into a c:data-element. This type of casting is well known from XProc 1.0, where the element p:data on an input port was responsible for converting non-XML to XML.

The step p:cast-content-type can also perform the opposite casting from an XML document with a c:data-element with encoded data as a child to the respective document type. All other conversions between media types are currently implementation defined. In this area some more work needs to be done, for example when it comes to cast a JSON document to an XML document. The current version (May-2018) of the XProc 3.0 specification defines that is has to result in a c:data-document with a base64-encoded representation of the JSON content. In XPath and XQuery Functions and Operators 3.1[2] we find a mapping from JSON to XML which is used in the two functions fn:json-to-xml() and fn:xml-to-json(). Making use of this mapping when it comes to casting a JSON document to an XML document and vice versa in XProc 3.0 is certainly an idea that should be discussed. In this line of thought we might also have a mapping from text document with media type “text/csv” to an XML document and vice versa. But some of the possible casting tasks to be performed by p:cast-content-type could definitively be scary. Let me just mention the case when an XML document with media type “image/svg+xml” should be casted to “image/jpeg”.

This much on the new concepts (and steps) introduced by XProc 3.0 to escape the XML-only limitation and to allow the design of XProc workflows for non-XML documents. Let us now come back to the use case shortly introduced at the beginning of this paper and discover the practical aspects of non-XML workflows in XProc 3.0.