Before we look at the new document model introduced with XProc 3.0 in detail, let us first look at the challenges the XProc Working Group had to face in order to allow processing of non-XML documents. In XProc 1.0 everything that could possibly flow from the output port of one step to the input port of another step, was a well-formed XML document. And every step defined in the step library, if it has an input port at all, it has an XML input port. This situation changes dramatically, once it was decided, that not only XML-documents could flow, but a wide variety of documents of different flavors. Then we have to differentiate between steps which could handle a certain kind of document, and steps, that have to raise an error because they could not possible do something useful with the document kind in question: Adding an attribute with a certain name and a certain value clearly only makes sense, when the document in question is an XML document. We do not even know what it could possibly mean to add an attribute to a text document or an image. Of course getting the image dimensions only makes sense of an image, but not for a text document or an XML document.
The other side of the same coin is a situation, when a pipeline author by accident exposes an image document to a step, that expects an XML document. Parsing an image document with an XML parser will most certainly lead into an error and the pipeline breaks. To be able to debug this kind of situation it is most useful not to get the information, that something is wrong with the XML document, as the parser will tell us, but to state, that a non-XML document appeared somewhere, where only an XML document is allowed. Thirdly, from the perspective of an XProc processor, it is important to know the kind of document it is dealing with, so it can choose an internal data model, that is suitable for the document type in question and for the operations defined for documents of this type.
The practical upshot of this reasoning is, that we need two things: We need to know what type a specific document has and we need a label for XProc steps, to say what kind(s) of documents they can deal with or allow on their specific input ports.
In order to cope with these two requirements, the XProc Working Group had to develop a
completely new understanding of what flows on an XProc pipeline from step to step. What
flows is still called "a document", but a document is now a pair
consisting of a representation and the document
properties. The representation is a processor specific data structure
which is used to refer to the actual document content. The document properties are pairs
of keys and values containing metadata of the content. The type or kind of the document
flowing between the steps is the most important metadata and it is associated with the
key content-type
. XProc 3.0 uses the well known media type
notation like application/xml
, text/plain
,
image/jpeg
and so on, to distinguish different types of
documents.
Steps in XProc 3.0 now also use the media type notation to declare, which kind of
documents are expected on a specific port. If a document that matches the step's
specification arrives, everything is fine: The step can perform the expected operation
and new documents (pairs of representation and document-properties) are produced on the
step's output ports. But if an incoming document does not match the content-types
expected by a step on a specific port, an specific error is raised by the processor,
telling e.g. that step add-attribute
expects a document with one of
the media types application/xml
, text/xml
or
application/*+xml
but a
document with content-type text/plain
was found. Which content-types
are expected on which ports is of course determined by the inner logic of the step:
Steps like p:identity
can obviously deal with any kind of document
because it only passes through the documents appearing on this input port. The same is
true for p:count
which counts the number of documents on an input
port and does not have to know, what kind of documents they are. But most steps known
from the XProc 1.0 step library typically require a document with an XML media type
to appear on its input ports, because adding an attribute, replacing an element,
renaming a namespace etc. only makes sense for XML documents.
The new concept of a “document” in XProc 3.0 consisting of a representation of a content of a certain type and the document properties as its metadata as we will see nicely solves the problem of opening up the well known XProc conception to non-XML documents. Regardless of the type of the documents, they are produced by a step and exposed on one of its output ports. The XProc processor then sends them to another step's input ports and will raise an error, if the document's type does not belong to the types of documents the step in question is able to deal with. The processor is able to do this, because the document properties flow with the documents between the ports and so a step knows what type of document is coming in.
But what about the pipeline author, how is she able to access the document properties?
You can easily imagine situations where a pipeline author might want to make a decision
based on the type of the document, for example because the output of a certain step
should be sent to step A
if it is an image, but to step
B
if it is of some other type. To make this kind of processing
possible, the document properties of a document are exposed to the pipeline author via a
bunch of XPath functions making mostly use of the map
type introduced
with XPath 3.1 [2]. More precisely the document properties are represented as
map(xs:QName, item())
and you can access them as a map by using
the XProc extension function p:document-properties($doc)
. Most of the
time pipeline authors will not be interested in the full map, but want to retrieve a
specific value. This can be done using p:document-property($doc,
$key)
. And finally there is a function called
p:document-properties-document($doc)
which returns an XML
document for the document properties of the document in question. In this document each
key of the map becomes an element and the value of the key becomes the element's values.
In this way pipeline authors can retrieve and evaluate the document properties of a
document using just familiar XPath expressions and do not need to use the new
expressions introduced for maps and arrays.
Now that we have seen how the document properties of a document in XProc 3.0 are
accessed the question may come up, how to control or set the document properties in a
pipeline. Typically in most cases you do not have to do this, because the XProc
processor is responsible for this: If a step declares the documents on an output port to
be, say of type text/plain
, the document properties of the resulting
documents are set by the processor accordingly. But sometimes obviously pipeline author
need to control the document properties themselves, e.g. when you are loading a document
and do not trust your file system to get the mime type of the document right. Another
use case for setting document properties in a pipeline is when you create a document
inline and want to tell the processor explicitly which document type the document has to
have. For those cases XProc 3.0 provides an additional option named
document-properties
which takes a map with document's metadata as
its values. Finally, if you need to add additional metadata to a document created by one
step before it goes into another step, there is a new step called
p:p:set-properties
, which can be used to overwrite existing
metadata or add additional data to the document properties.
Having talked about the new concept of a document being a pair of a representation and its
document properties and having discussed the document properties to some extent, the
next thing we have to cover is the representation part of the document pair. As said
above, the representation is a data structure used by the processor to refer to the
document's content. Which representation a processor has to use in order to deal with
the content of the document is defined based on the document's media type. The current
version of the specs (May-2018) calls out four different types of representations:
Obviously there are XML documents identified by an XML media type,
which is “application/xml text/xml application/*+xml
”. Secondly the
specs mention text documents which are identified by media type
“text/*
”. Thirdly there are JSON documents
which have media type “application/json
” and forth there are the so
called binary documents which are identified by any media type not
mentioned yet. Implementers are obviously free to implement additional document types
identified by media types, as long as they do not conflict with the ones mentioned
before.
As we learned before, the XProc 3.0 specification defines a representation for documents of a specific media type, where "representation" means “a data structure used by an XProc processor to refer to the actual document content”. And this brings us to another important change, the XProc working group decided to make for XProc 3.0, because they say in the specs:
Representations of XML documents are instances of the XQuery 1.0 and XPath 2.0 Data Model (XDM).
Therefore XProc 3.0 uses the same data model as other XML related technologies like
XQuery or XSLT. And this is a strong change in the concept of what an XML document is.
In XProc 1.0 the concept of a well-formed XML document was used, i.e. every
document had exactly one element node, optionally preceded and/or followed by comment
nodes, processing instructions and all whitespace text nodes. Well-formed XML documents
are fine, but the stipulation, that every XML document has to be well-formed all the
times is very burdensome when you try to make even very slight modifications. One thing
I ran in a lot of times for one specific project, was the problem to add a
processing-instruction as first node of the preamble, so the browser could recognize the
XForms in the produced documents. This might sound like an easy task for someone
familiar with XQuery or XSLT, but in XProc 1.0 it was tricky. The natural choice is
of course p:insert
where one matches the root element of the
document, and tell the processor to insert the processing instruction before the root
element. But you can not do this, because p:insert
inserts
documents into other documents, not nodes.
So you can not insert a processing instruction, but have to insert either a processing
instruction followed by a dummy element node or wrap the processing instruction into an
element node in order to fulfill the “well-formed documents only” rule. But obviously
you can not insert this document before the current element node of the document,
because a well-formed document can not have two top level elements. So here is one way
of doing this:
<p:wrap-sequence wrapper="dummy" /> <p:insert match="/dummy" position="first-child"> <p:input port="insertion"> <p:inline> <dummy2><?pi target?></dummy2> </p:inline> </p:input> </p:insert> <p:unwrap match="/dummy | /dummy/dummy2"/>
But in XProc 3.0 where an XML document is to be implemented according the XDM [3] concept, one can do it in the most natural way:
<p:insert match="/" position="first-child"> <p:with-input port="insertion"> <?pi target?> </with-input> <p:insertion>
The second document type, which is newly introduced with XProc 3.0, is a
text document. Text documents are characterized by a media type
that matches the scheme text/*
, with the exception of
text/xml
, which is an XML document. Constructing a new text
document is as easy as constructing an XML document:
<p:identity> <p:with-input> <p:inline content-type="text/plain" >This is a new text document</inline> </p:with-input> </p:identity>
The XProc processor will produce a new text document on the output port of
p:identity
. This document will consist of a document node with
just one text node child which holds the text. Doing the representation of text
documents in this way has the obvious advantage that it fits perfectly with the use of
XPath as an expression language in XProc. Suppose you have a sequence of text documents
and for some reason you want to treat text document differently whose second word is
“is
”. Since the text documents in XProc are a special kind of a
document as defined in XDM, you can do it as easy as this:
<p:choose> <p:when test="tokenize(.,'/s')[2]='is'"> <!-- ... --> </p:when> <!-- ... --> </p:choose>
To represent text documents as a text node wrapped into a document node also allows us
to use them in p:wrap-sequence
to wrap an element node around the
text and thereby produce an XML document. Of course you can also use
p:insert
to insert the text node of a text document as a child of
an already existing element node. And finally, if you select from an XML document and
the resulting nodes are all text nodes, the XProc processor will create a new text
document for you. We will see more applications for text documents when we come to
discuss our example workflow in more detail, but for now we can record, that text
documents in XProc 3.0 are pretty well integrated into the XML universe.
Next up is JSON. Integrating JSON into the XML world was a high
priority during the last years and great work has been done to archieve this goal with
the new standards of XDM 3.1 and XPath 3.1. If you take a look at
XSLT 3.0 you will find that working with JSON feels almost as natural as working
with XML. Based on the cited works, JSON is now also integrated into XProc 3.0. As
you might expect from the preceding discussion, it is called a “JSON
document
”. The document properties of a JSON document have a content-type
entry that contains a JSON media type like “application/json
”. As
JSON is a text based format, you can easily construct JSON documents within XProc
pipelines:
<p:identity> <p:with-input> <p:inline content-type="application/json" >{"topic":["XProc", "3.0"]}<p:inline> </p:with-input> </p:identity>
As you might expect, if you are familiar with the treatment of JSON in XPath, the
XProc processor will use the function fn:parse-json()
on the string
supplied and produce an XDM representation of this JSON document. In the given case it
will obviously be a map item with one entry mapping a string to an array item containing
two strings.
Now this representation is perfectly in accordance with what you might expect if you come from an XPath, XQuery or XSLT background, however it does not quite fit with the XProc concept of documents flowing between input and output ports. And this is because the representation of the JSON document is not an instance of an XDM node, but a map item (or an array item or an atomic value in other cases). And neither a map item nor an array or an atomic value can be the child of a document node per XDM. If you recall the definition of an XProc document you can now understand, why it is not defined as a pair of document properties and a node (which is true for XML and text documents), but as a pair of document properties and a representation. The representation for some document types might be an XDM node, but as for JSON documents it is not.
Let us take a closer look how this concept of a document fits into XProc and XPath.
First of all, our JSON document produced on p:identity
flows out of
the step on an output port which is typically connected to the input port of some other
step. As said above, what happens then depends on whether the receiving step accepts a
JSON document on the respective input port. For example if it is a
p:store
the document will be written to some destination as you
might expect. But if the receiving step is e.g. a p:add-attribute
a
dynamic error will occur, because a JSON document is not allowed on the input port of
this step. But this is nothing special for JSON documents but applies to all documents
in XProc. If, for example, an XML document appears on an input port that only allows
JSON documents to flow in, a dynamic error is raised too.
As you can see, JSON documents are first class citizens in XProc 3.0 when it
comes to the question of what can flow between steps. But if you are familiar with
XProc, you might recall, that documents do not only flow between steps, but can also
appear as context items when it comes to evaluating XPath expressions. Here JSON
documents do not fit quite as well, because their content is not represented as
something which is an XDM node and therefore an XPath expression like
"/"
can not expose the content to XPath. For
"/"
the XProc processor is required to construct an empty
document node, so p:document-properties('/')
will return the document
properties of JSON documents as well. To overcome this problem is obviously very easy if
you imagine an XProc defined XPath function, which takes the document node associated
with the JSON document as a parameter and returns the same representation of the JSON
document as XPath's fn:json-doc()
would. As of May 2018 you will not
find such a function in the specifications for XProc 3.0, but I am pretty sure the
community group now taking care of XProc's development will find some way to bridge the
gap between JSON documents and XPath expressions.
Finally XProc 3.0 defines a fourth document type called “binary
document”. A binary document is actually anything
which is not either an XML document, or a text document or a JSON document, or, more
precisely which has a media type not associated with these three document types. This
document type sums up such different kinds of data as ZIP-archives, all kinds of images,
PDF-documents and every thing else we have on our file systems or receive from web
services. As for JSON documents the XProc processor is required to construct an empty
document node, so p:document-properties('/')
will return the document
properties associated with this document. How a binary document is represented
internally by an XProc processor is implementation defined. And it
is obvious that not all binary documents will be internally represented in the same way
by an advanced XProc processor: Smaller documents will probably be held in memory for
fast access, but if you think about very large documents (as a video or an audio file),
some optimization will be necessary. One strategy is to store those files away in a
temporary folder and let just references to these files flow between the steps. Only if
a step actually needs to access the document's content, the file or parts of it are
loaded by the XProc processor.
Because the representation of binary documents has to be implementation defined,
XProc 3.0 currently defines no way to access the document's content within an XPath
expression. One can easily imagine an XProc defined XPath function returning the
document's content as xs:base64Binary
or as
xs:hexBinary
. But the main problem here is, that in most cases
you do not want the whole document content, but are only interested in a smart portion
of it. For this reason an implementation returning the whole content, which may be very
large, and then use XPath expressions to identify the small range the pipeline author is
really interested in, would be very inefficient. This problem is not impossible to
solve, but the XProc community has not agreed on a solution yet. One way to solve it
would to determine the content's size, either as part of the metadata in the
document-properties or as a function taking the binary document as argument. This might
be complemented by an XProc defined XPath function which allows to select a part of the
document's content. The specification of the EXPath binary module could certainly be a
role model for solving this
problem.[Kosek:Lumley:Binary:Module:1.0]
Together with the new document model, XProc 3.0 introduces a new step to convert
or cast the different document types into each other:
p:cast-content-type
. This step takes an arbitrary document on its
input port and the content-type this document should be casted to and returns a casted
document on the output port (or throws an error if the XProc processor is not able to
perform the requested casting). This abstract characterization is necessary, because
this step is a kind of “Jack of all trades” of document processing in XProc. The easiest
task this step can perform, is to cast from one XML media type to another, say from
“application/xslt+xml
” to “application/xml
” or
vice versa. Here the actual document representation does not need to be changed in any
way, just the value of key content-type
in the document properties
needs to be changed. Casting from a non-XML document type to an XML document type will
produce an XML document by wrapping the representation of the non-XML document into a
c:data
-element. This type of casting is well known from
XProc 1.0, where the element p:data
on an input port was
responsible for converting non-XML to XML.
The step p:cast-content-type
can also perform the opposite casting
from an XML document with a c:data
-element with encoded data as a
child to the respective document type. All other conversions between media types are
currently implementation defined. In this area some more work needs to be done, for
example when it comes to cast a JSON document to an XML document. The current version
(May-2018) of the XProc 3.0 specification defines that is has to result in a
c:data
-document with a base64-encoded representation of the JSON
content. In XPath and XQuery Functions and Operators 3.1[2] we find a
mapping from JSON to XML which is used in the two functions
fn:json-to-xml()
and fn:xml-to-json()
. Making
use of this mapping when it comes to casting a JSON document to an XML document and vice
versa in XProc 3.0 is certainly an idea that should be discussed. In this line of
thought we might also have a mapping from text document with media type
“text/csv
” to an XML document and vice versa. But some of the
possible casting tasks to be performed by p:cast-content-type
could
definitively be scary. Let me just mention the case when an XML document with media type
“image/svg+xml
” should be casted to
“image/jpeg
”.
This much on the new concepts (and steps) introduced by XProc 3.0 to escape the XML-only limitation and to allow the design of XProc workflows for non-XML documents. Let us now come back to the use case shortly introduced at the beginning of this paper and discover the practical aspects of non-XML workflows in XProc 3.0.