Improving quality-critical XML workflows with XProc 3.0 pipelines

Achim Berndzen

<xml-project />

Thorsten Rohm

Head of Content Architecture & Management
Thieme Compliance GmbH


Orchestrating complex XML pipelines has been a major topic of XML-related software development over the years. Comprehensive techniques have been developed to:

  1. Deliver high-quality results

  2. Ensure that the pipelines can be maintained

  3. Allow the pipelines to be debugged for straightforward troubleshooting

The quality demands for the workflow and the produced results can vary: For example, you may find a very maintainable pipeline producing documents with very low quality demands, e.g. the system producing the static website for your local sports club. On the other hand you might find documents with very high quality demands produced by a pipeline that is not easy to maintain and debug. And, of course, the relationship between the quality of the documents and the maintainability of the pipeline producing these documents may change over time. Implementing new quality demands for the documents might have a negative impact on the pipelines quality. And sometimes in the history of developing a pipeline expected to produce documents with high quality demands, you might even decide to start over, as new quality demands for the documents threaten to impair the quality of your pipeline.

In this paper, we would like to report about a shared project of our two companies. We had to add new features to a well-established workflow producing documents in the medical sector that come with very high quality demands. As the existing workflow already had some pain points, we decided to start over and to refactor it. And we even decided to change the basic orchestrating technology: Since the existing workflow was based on a combination of Windows batch files calling different programs and some very elaborate XSLT stylesheets, we decided to use XProc 3.0 to orchestrate the workflow, thus doing away with as much shell scripting as possible while keeping the XSLT stylesheets to do the actual transformations.

As XProc 3.0 is a relatively new technology for orchestrating document workflows, we think our project might be of some interest to people developing and/or maintaining pipelines for documents with high quality demands. We will first provide some background context for the produced documents and their actual usage to elaborate the specific quality demands. This will be followed by an overview of the existing workflow and a discussion on its pain points and new demands. We will then give an overview of the new XProc 3.0 pipeline developed in the project and discuss some aspects of the used technology. The paper[4] concludes with the lessons learned in our project and the key takeaways of our project in a more general context of pipelines producing documents with high quality demands.

[4] We would like to thank the reviewers of our abstract for their very helpful comments. A special thank goes out to Geert Bormans whose thoughtful remarks on the abstract helped to improved this paper significantly.

Table of Contents

Introduction and background
About Thieme Compliance GmbH and patient education leaflets
About <xml-project /> and XProc
Introduction to existing batches
Batch “fragengruppe_2_evidence”
Batch “fragengruppe_2_FHIR-Questionnaire”
Pain points of the existing batches
Lacking of flexibility for inserting additional XSLT steps (in between)
No easy way to debug the intermediate results of each XSLT step
Too many tools means too many dependencies
New requirements for next version
Future-proof approach and improved maintainability by adding a separate orchestration layer
Increased quality through validation of XML sources using T0 XSD as well as validation of XML results using specific versions of T0 DTD
Increased quality by additional validation of XML results using Schematron
Summarised, formatted and easily comprehensible log files
Performance improvement by omitting unnecessary images from the Zip archive
Limiting processing to specific sources from the source folder
New system based on XProc 3.0
Smooth transition to XProc 3.0
MorganaXProc-IIIse worked well and could even be improved over the course of the project
Serialisation is now done by MorganaXProc and no longer by Saxon
Performance problems with FHIR XML schema
XProc pipeline optimisation by loading stylesheets only once at the beginning
Feature request for XProc: please add <p:validate-with-dtd>

[4] We would like to thank the reviewers of our abstract for their very helpful comments. A special thank goes out to Geert Bormans whose thoughtful remarks on the abstract helped to improved this paper significantly.