The non-XML workflow to be developed here was shortly introduced above, but now let us look into details and see how we could realize it in XProc 3.0. As you might recall, the workflow deals with ePUBs stored somewhere on our file system. And our workflow should create some RDF metadata about the ePUB's content and create an inventory which has to be sent to a JSON-only web service. As said above this workflow is a made-up story to explore the new possibilities of XProc 3.0. It is not a real life project, but could serve as a blueprint for those.
First let us sum up, what kinds of non-XML documents are involved in our workflow: First of course we have ePUBs which are essentially ZIP documents with a defined structure. Then we have to produce some metadata according to the RDF model, which might be represented as XML (like in RDF/XML or RDFa). But as we deal with non-XML document types, we will of course use one of the text based serializations of RDF, namely Turtle. The source for our metadata generation will be the Dublin core metadata information expressed in the ePUB's root file. Our ePUB will typically also contain a lot of image files and someone wants to know, which images are used in which ePUB. So we will have to create an inventory of all the images (JPEG, PNG and GIF) in the ePUBs together with their width and height. This inventory has to be in JSON because we need to send it to an inventory server which only understands JSON. So we have a pretty zoo of different non-XML document formats to deal with in our pipeline.
Let us start with the outermost step which has just the task of finding all ePUBs in a given folder:[17]
1 <p:declare-step version="3.0"> <p:option name="epub-folder" as="xs:anyURI" required="true" /> <p:directory-list path="{$epub-folder}" 5 include-filter=".*\.epub" recursive="true" /> <p:for-each> <p:with-input select="//c:file" /> 10 <epp:analyze-epub> <p:with-option name="href" select="concat($epub-folder,/c:file/@name)" /> </epp:analyze-epub> </p:for-each> 15 </p:declare-step>
For those readers with little or no experience in XProc, let me just say that the step
p:directory-list
will produce a list of content for the directory
specified by path
, in our case containing only directory entries
which match the regular expression given with include-filter
. The
step produces an XML document with a c:directory
root element
containing c:file
or c:directory
elements. Since
we are only interested in (ePUB-)files, the p:for-each
will select
all the respective elements. The treatment of the ePUB is actually done in the user
defined step epp:analyze-epub
, which is called with the ePUB's
absolute URI as option value.
Readers familiar with XProc will discover some of the new features of XProc 3.0
in this example: We now have typed values for options and variables (expressed by
attribute as
on the p:option
-declaration). While
in XProc 1.0 all values of options or variables were either strings or
untyped atomic
, they can now have any value (including documents,
nodes, maps and arrays) and the XProc processor has to make sure they only have a value
of the declared type. The second point you might have discovered are the curly braces
and if you are familiar with XSLT just might have guessed that they are attribute value
templates. If so, you are right. XProc 3.0 introduces attribute value templates and
text value templates (as known from XSLT 3.0). AVTs prove to be very handy to write
shorter pipelines, because now you can use the attribute shortcut for options even if
they contain XPath expressions. The long form with p:with-option
will
only be necessary if your XPath expression refers to the context item because the
context item for AVTs is undefined. This is why we have to use the explicit form in
supplying the value for href
on epp:analyze-epub
.
And finally you may have discovered that p:directory-list
has a new
attribute, so you can decide whether you only want a listing for the top level directory
or also for any directory contained.[18] So this pipeline might have some interesting new stuff, but when it comes to
non-XML documents it is quite boring. So let us go on and see how we can deal with
non-XML documents.
The first non-XML format we have to deal with is of course ePUB which is a special
kind of ZIP archive. Uncompressing and compressing this kind of archive format is
already essential to many traditional XProc workflows, because it is not only needed for
ePUBs but is also the underlying format for “docx”. Using the framework of
XProc 1.0 two different approaches have been developed to deal with archives: The
step pxp:unzip
, proposed by the EXProc-community allows you to
extract one document which will appear on the step's output port: If the document has an
XML context-type, the document itself appears, but for every other document a
c:data
document is returned which contains the base64-encoded
representation of the selected zip's content. This approach is totally in line with
XProc's basic concept, but obviously has a lot of limitations. The second approach to
deal with ZIP archives is represented by the step tr:unzip
which is
part of the transpect
-framework developed by le-tex publishing services.[19] This step extracts a complete ZIP archive (or a single entry) to a specified
destination folder in the file system. Here the XML-only limitation is circumvent by
writing to the file system instead of exposing the base64-encoded content on a result
port. But it obviously breaks away from the concept of documents following between steps
on ports.
In XProc 3.0 we can now have the best of the two approaches: Thanks to the
extension of the document model now XML and non-XML documents can flow on the output
ports of a new uncompress
step, which might have the following
signature:[20]
<p:declare-step type="xpc:uncompress"> <p:input port="source" content-types="*/*" /> <p:output port="manifest" sequence="true"/> <p:output port="result" content-types="*/*" sequence="true" primary="true"/> <p:option name="include-filter" as="xs:string+" /> <p:option name="exclude-filter" as="xs:string+" /> <p:option name="method" as="xs:token" select="'zip'" /> </p:declare-step>
The ZIP-archive flows into this step on the port source
which
intentionally accepts all content types. The first reason is that ZIP archives can
appear with many different media types where some do not even have the suffix “zip”. The
second reason is, that this step is designed as a kind of Swiss knife for different
kinds of archive formats. The documents contained in the archive flow out on the port
result
, which is the primary output port. For a typical archive
this will be a sequence of different document types, where each document is a pair of a
representation and its document properties. The options
include-filter
and exclude-filter
can be used
to control, which entries from the archive should appear on the output port. Like the
options with the same names used on XProc's standard step
p:directory-list
they are interpreted as regular expressions used
to match the names of the archives entries. Unlike its predecessor the step now make use
of XProc's new alignment to the XDM type universe. So we can now supply a sequence of
regular expressions and say, that an archive's entry is returned on the output port, if
its name is matched by at least one of the regular expressions on
include-filter
and by none of the regular expressions of
exclude-filter
. Obviously this gives you a very powerful
mechanism to control, which entries are extracted from the archive and which are
not.
Now let us put this step into action in the workflow we have to design. We are not
interested in all archive entries, but only in the image files (since we have to create
an inventory of them) and the root file, since it contains the metadata we are after.
For brevity I will skip the problem of identifying the ePUB's root file by inspecting
entry “META-INF/container.xml”. We will guess, that the root file is found in a document
named “package.opf". Also for brevity the following pipeline will read the ePUB twice,
once to extract the root file and another time to extract the graphic files. In a real
life project one would probably only open the ePUB once for efficiency reasons and then
split the resulting sequence into the root file and the graphic files. Here is what our
step epp:analyze-epub
might look like:
<p:declare-step type="epp:analyze-epub"> <p:option name="href" as="xs:anyURI" required="true" /> <xpc:uncompress include-filter=".*/package\.opf"> <p:with-input href="{$href}" /> </xpc:uncompress> <epp:extract-metadata /> <xpc:uncompress> <p:with-input href="{$href}" /> <p:with-option name="include-filter" select="('.*\.jpg', '.*\.png', '.*\.gif')" /> </xpc:uncompress> <epp:create-inventory /> </p:declare-step>
If you look at this short pipeline, I think you will recognize how natural it is now
to work with non-XML documents in XProc 3.0. We extract an XML document named
“package.opf” from the ePUB and let it flow into step
epp:extract-metadata
and we extract the relevant graphic files
from the ePUB and let a sequence of non-XML document flow into step
epp:create-inventory
.
Now let us turn to the conversion of the Dublin core metadata contained in
opf:metadata
of the ePUB's root file into the Turtle
serialization of an RDF graph. By looking at the ePUB's root file, we will find the
metadata as the following example
shows:
<opf:metadata xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:identifier>urn:isbn:978-80-906259-2-1</dc:identifier> <dc:title>XML Prague 2017</dc:title> <dc:language>en</dc:language> <dc:creator>Jiří Kosek</dc:creator> <dc:date>2017</dc:date> </opf:metadata>
From this format we need to generate a Turtle serialization which should look like this:
<urn:isbn:978-80-906259-2-1> dc:title "XML Prague 2017" ; dc:language "en" ; dc:creator "Jiří Kosek" ; dc:date "2017" .
I think the first intuition of many readers will be to use XSLT for this conversion.
As an XProc author I would definitely agree with this intuition and write an XSLT
stylesheet called by XProc's p:xslt
to invoke the transformation.
With XProc 3.0 this is possible because the text document created by XSLT is now a
first class citizen of a pipeline and there is no need anymore to wrap it in an element
node to make it an XML document.
But as this paper deals with non-XML documents in XProc 3.0, let us see, how this could be done without invoking an XSLT transformation. The following fragment shows how one might do it:
<p:variable name="id" select="/opf:metadata/dc:identifier" /> <p:for-each> <p:with-input select="/opf:metadata/dc:*[not(name(.) = 'dc:identifier')]" /> <p:variable name="entry" as="document-node()" select="." /> <p:identity> <p:with-input> <p:inline content-type="text/turtle" >{$entry/*/name()} "{$entry/*/text()}"</p:inline> </p:with-input> </p:identity> </p:for-each> <xpc:aggregate-text separator=" ; 
" /> <xpc:add-text text="<{$id}> 
" position="before" /> <xpc:add-text text="." position="after" />
Here a text value template (known from XSLT) is used to create a text document for
every element entry in the Dublic core namespace. We have to use the variable
entry
(which is a document node), because as for AVTs the context
item is undefined for TVTs. What appears on the output port of
p:for-each
is a sequence of text documents, each containing the
predicate and the object of a statement. The step xpc:aggregate-text
then takes this sequence of text documents to create one single text document. Between
two adjacent text documents a semicolon and a carriage return is inserted. And finally
the two appearances of xpc:add-text
put the ePUB's identifier in
front of text and a colon behind it so we have a valid Turtle statement.
As you can see, text based format like Turtle can be very easily created using the new features introduced with XProc 3.0. Currently the standard step library does not contain any steps dealing especially with text documents. The two steps used in the previous example are pretty good candidates for this, but I am not sure they will make it into the final library. The reason here is, that both steps can be written in XProc itself using the string functions provided by XPath. So it might be a question of principle whether to include such steps in the standard library which would make pipeline authoring more convenient or ask the authors to import their XProc implementation into their pipeline every time they need this functionality.
When it comes to RDF itself as a theoretical concept, there is currently no support in XProc 3.0. Of course RDF/XML and RDFa are supported as they are XML documents and text based serialization formats of a graph can be handled as we have just seen. But an RDF graph as a theoretical concept in opposition to its various representations is currently not one of XProc's document types. The XProc Next Community Group has mentioned RDF several times in their discussions at the various meetings, but not really tackled the topic yet.[21] There might be good reasons to extend the current document model by RDF graphs. The RDF document type would be independent of any serialization form and there could be steps to parse a serialization form to an abstract graph and to serialize the graph. Additionally we could have steps, that add triples to a graph or remove a specific triple etc. Going further one could wish for a step to validate the graph with SHACL or another step to query the graph with SPARQL. In his paper for XML Prague 2018 Hans-Jürgen Rennau [5] argues that XML and RDF are complementary concepts and that “an integration of RDF and XML technologies is a very promising goal.” Given our previous discussion one might think of XProc as one of the places where this integration might take place. But as I said before, there is no decision on whether RDF graphs will become part of XProc's document model. And there might be doubts they will make it, because sometimes its better to get things done, than to get things perfect.
Let us go back to our workflow example. Having dealt with ZIP archives and ePUBs, text documents and Turtle we now have to turn to the last open point of our workflow which is to create a JSON inventory of the image files contained in the ePUB. Given the fact one of XProc's mayor use cases today is in publishing, the lack of support for images and image processing is surely striking. Pipeline authors had to step in here and write their own extension steps to do at least some rudimentary image processing.[22] But this is typically only an in house solution because you have to write these steps in the programming language the XProc processor used is written in and you have to make known these steps to the processor in some vendor specific way. Pipelines using these steps are not interoperable with other XProc processors or other configurations of the same processor.[23]
As we saw above, XProc 3.0 changes this with the introduction of the new document model which allows images to be loaded in a pipeline and to flow between steps. Currently there are no special steps dealing with images, but you can easily imagine steps that extract data from an image document or do some image processing e.g. scaling. And finally it is now very easy to create an ePUB containing XML documents and images alike. The old workaround was to create an intermediate folder on the file system, storing all XML documents into this folder and copying all the images there too, and then call a step to create an archive from the respective folder. With the new document model you will not need this workaround anymore but can simply have a zipping step, taking all the (XML and non-XML) documents on its input port sequence and creating an archive from it to appear on the output port.
As you might recall, the workflow we are designing, involves some image processing because we are requested to create an inventory of all the image files contained in an ePUB and this inventory has to contain the dimensions of the images as well. For this we need a step that takes an image document on its input port and produces an XML document containing the required information on its output port. The signature of such a step might look like this:
<p:declare-step type="xpc:image-profile"> <p:input port="source" content-types="image/jpeg image/png image/gif" /> <p:output port="result" /> </p:declare-step>
On the step's output port an XML document appears containing information about the image, which might look like this:
<c:image-profile> <c:image-property name="name" value="pic1.jpg" /> <c:image-property name="mimetype" value="image/jpeg" /> <c:image-property name="width" value="300" unit="px" /> <c:image-property name="height" value="500" unit="px" /> <!-- more properties to come here --> </c:image-profile>
The XProc Next Community Group has not decided yet about the format of the resulting XML document. As for some applications the use of attributes seems to be convenient, other applications may prefer to have the properties as element names and the values as text children of these elements. There is no reason why such a step should not have an option allowing the pipeline author to select between these and other possible formats. Actually for the workflow discussed in this paper it would be very handy, if the output is not restricted to different varieties of XML documents, but could also be a JSON document, as we have to send the graphics inventory to a JSON only web service.
So it comes to JSON as the last data format or document type we have to consider in
our workflow. From what we have done so far, we could have an XML document containing
all the c:image-profile
documents for the image files of an ePUB.[24] And there are different ways to produce the lexical JSON we would like to
send to a web service:
The first way to produce a JSON representation of this document has little to do with
the newly introduced JSON document type in XProc, but uses text documents as a vehicle
for lexical JSON. And of course it makes use of the XPath function
fn:xml-to-json()
which takes an XML document in a special
designed vocabulary as an argument and returns a string conforming to the JSON grammar.
Since we need a textual representation of the JSON document if we want to send it to a
web service, the string result here is fine for us. If we would need actual JSON,
calling the function fn:parse-json()
with the previous function
call's result as a parameter would do the job. All we need to do to generate the JSON
document therefore is (1) call an XSLT stylesheet that takes our source document and
transforms it into the XML format expected for fn:xml-to-json()
and
(2) create the request document for p:http-request
.
The second possible strategy to create a lexical JSON representation of our image
inventory document is to create the text document directly with XSLT without the
intermediate step of creating an XML-document in a format suitable for calling
fn:xml-to-json()
. This might be a plausible strategy too, but as
XML to XML transformation with XSLT is an everyday job, one might be better off doing
this and leaving the problem of transformation to JSON to the processor's built-in
function.
Thirdly we could create the lexical JSON document directly in XProc as we have done with Turtle in the example above. Lexical JSON and Turtle are both text based formats so using XProc 3.0's new text documents seems to be a practicable way.
Taking all these possibilities together, one might come up with the question if the
JSON document type introduced with XProc 3.0 has a meaningful purpose at all.[25] The underlying impression is boosted by the fact, that the only step
currently defined for JSON documents is p:load
. One might expect,
that one processor or the other additionally may support JSON documents in
p:store
(as non-XML serialization is an optional feature). As no
other step is currently defined in XProc's standard library one has either to rely on
the processor or site specific extension steps or (as I would expect) convert JSON to
XML (fn:json-to-xml()
) and back (fn:xml-to-json()
).[26] This shortcoming may in part be due to the pending update of the step
library. For example: XSLT 3.0 widened the concept of the “initial context node” to
the new concept of the “initial match selection”, which includes not only a sequence of
documents, but also a sequence of parentless items like (XDM) values and maps. This
change in the underlying technology will most certainly be reflected in an updated
signature of p:xslt
.[27] And this updated signature might also allow JSON documents to flow into the
input port of p:xslt
. Along this line of thinking
p:xquery
might be another step where JSON documents flow in (and
out).
Another way to make JSON documents more useful to pipeline authors may be the introduction of JSON specific steps into XProc 3.0's standard step library. Concerning our task to create a JSON document from the sequence of XML documents with image information, a step that creates a JSON document containing a map and a step that joins a sequence of JSON documents with maps to one single JSON document would be helpful. Omitting the exact specification of such steps, a possible pipeline might look like this:
<p:for-each> <p:output port="json-info" content-type="application/json" /> <p:variable name="props" select="[ xs:string(//*[@name='mimetype']/@value), xs:string(//*[@name='width']/@value), xs:string(//*[@name='height']/@value ]" /> <xpc:json-document> <p:with-option name="value" select=" map{xs:string(//*[@name='name']/@value) : $props} "/> </xpc:json-document> </p:for-each> <xpc:aggreate-json-map />
The result here should be a JSON document containing a map, where in each map entry the key corresponds to the name of the image file and the value of each entry is an array containing the mime type, the width and the height as strings in that order. Obviously this might be done in a more elegant way, but as we are concerned only with the basic concepts here, this example might be sufficient.
Thinking along these lines we might invent other JSON specific steps which might be useful in pipelines. If we restrict our selves to JSON documents that are maps, one might think about a step, that adds one entry to the map:
<p:declare-step type="xpc:add-to-json-map"> <p:input port="source" content-types="application/json" /> <p:option name="key" as="xs:string" required="true" /> <p:option name="value" required="true" /> </p:declare-step>
And then of course we would also need a step to remove a key/value entry from a map e.g. by giving a key.
But the key problem to me with JSON documents still is their connection to XPath and
XDM which is, as I said above, currently missing in XProc 3.0: While we are able to
express something like “remove the third child element of this node in this XML
document” for XML documents, we are not able to say “remove the third key/value pair
from the map in this JSON document”. We might invent some steps for JSON documents, but
currently we are limited to mutations of the top level map (or array) since we are not
able to say something like “select entry four of the array that is associated with key
a
”.
The other problem of course is, that JSON and JSON documents can not be mapped to XDM
instances on a 1:1 basis: This is because a JSON document (at its “root”) may either be
a map or an array or an atomic value. Therefore the above proposed step
xpc:add-to-json-map
is quite naive because it presupposes, that
the JSON document on the source port contains a map. But if the top level object of this
document is not a map, an error would be raised most probably. And this error could not
be avoided by a prior checking, because currently there is no way to ask, whether the
top level object of a JSON document is a map or something else.
To sum up our discussion of JSON documents I think it is fair to say, that some more work has to be done, to make them really useful to pipeline authors. This (preliminary) assessment is surely contrasted by the fact, that we can do a lot of useful things with JSON in XProc 3.0, because we now have maps and arrays for variables and options, and we have text documents for lexical JSON. Finally with the XML representation of JSON invented for XPath 3.1 we have a lossless way of representing JSON in XML, can make use of all the XProc steps to manipulate the document and then put it back to lexical JSON in order to store it or send it to the web.
[17] To keep the following code readable, I will omit all namespace declarations.
[18] As of May 2018 you will not find this option in the specification, as we have not done much work on the standard step library yet. There was a request for this feature which I think is very handy, but the name of the option and its exact behaviour can not be taken for granted yet.
[19] See http://transpect.github.io/modules-unzip-extension.html.
[20] Again, this exact specification of this step is not formalized in the specification yet. It is pretty sure that such a step will be part of XProc 3.0's standard step library, but the exact signature (e.g. names and types of the options) and the step's exact behaviour is still under discussion.
[21] For XProc 1.0 there are some attempts to deal with RDF documents. The most prominent are, of course, the RDF extension steps for XMLCalabash.[4]
[22] See for example the steps image-identify
and
image-transform
from le-tex's transpect framework.
[24] For brevity the XProc snippet is left out here: It is just a
p:for-each
iteration over the image documents delivered
from uncompress
, calling xpc:image-profile
on each and do a p:wrap-sequence
on the sequence flowing out
of the p:for-each
.
[25] To avoid misunderstanding: We are talking about JSON documents as an XProc document type not about maps and arrays as part of XDM. Having variables and options that can contain maps and arrays is useful without doubt. Replacing parameter ports with maps should count as a prove.
[26] See the above discussion on implementation defined aspects of
p:cast-content-type
.
[27] To ensure compatibility with legacy pipelines or to allow the use of XSLT 2.0-only processors, one might also think of adding a new step for XSLT 3.0 transformation.