Rethinking document-to-document transformation

As a starting point, we introduce a simple example transformation.

Source document:

resources
. books
. . @lastUpdate?
. . book*
. . . @isbn
. . . title
. . . author*
. . . py?
            

Target document:

publications
. @updatedAt?
. publication*
. . @publicationYear?
. . isbn
. . title
. . creator*
. . . creatorRole
. . . creatorName
            

The implementation of a transformation can use a push approach or a pull approach:

As a rule of thumb, the push approach is more appropriate when the structure of the target document strongly depends on the actual input data, which is typically the case with document oriented XML, mainly consisting of human readable text and structured according to its actual content. The pull approach, on the other hand, tends to be much more straightforward with data oriented XML, where the target document structure is mainly determined by the target document model, rather than unpredictable text structure.

In this discussion we focus on pull transformation, as we are interested in leveraging the information stored in the target document model. The following listing demonstrates the striking similarity between target document structure and code structure, which can be achieved using the pull approach.

Example 1. XQuery code implementing the example transformation as a pull transformation.

let $context := /* return

<publications>{
   let $value := $context/books/@lastUpdate return
   if (empty($value)) then () else
   attribute updatedAt {$value},
   for $context in $context/books/book return 
   <publication>{
      let $v := $context/py return
      if (empty($v)) then () else
      attribute publicationYear {$v},   
      <isbn>{$context/@isbn/string()}</isbn>,
      <title>{$context/title/string()}</title>,
      for $context in $context/author return  
      <creator>{
         <creatorRole>{'Author'}</creatorRole>,
         <creatorName>{$context/string()}</creatorName>
      }</creator>
   }</publication>
}</publications>

Such code is constructed according to a simple pattern:

  1. Instantiate the root element (write start tag; content will follow)

  2. For each attribute model on this element model:

    1. Decide whether to construct this attribute

    2. If yes: set the value

  3. For each child element model on this element model:

    1. Decide how many instances to construct (possibly none)

    2. For each instance:

      1. Instantiate element (write start tag; content will follow)

      2. If element has simple content: set the value

      3. If element has attributes/child elements: continue at (2)

The example illustrates that the code of a pull transformation can be regarded as a sum of two components: a scaffold which reflects the target document model and structures the code; and “nuggets” – small pieces of code pulling information from the source document. Note the frequent use of the variable $context, whose value is a shifting set of source nodes. It provides an appropriate starting point for navigating to the nodes of interest, for example:

   $context/title

Here, the value is always bound to the “right” <book> element, which is the <book> currently processed. Such a context is essential for locating the appropriate source nodes.

The transformation can be regarded as a sequence of decisions, which is orchestrated by the target document structure. Main kinds of decision are:

• A? - Whether to construct an attribute (2a)
• #E? - How many instances of an element to construct(3a)
• A=? - Which value to use for an attribute (2b)
• E=? - Which value to use for a simple content element(3b.ii)

For following table compiles code snippets, stating the kind of decision they implement and identifying the source and target nodes involved.

Table 1. Code snippets taken from the example of a simple pull transformation.

Transformation codeSource nodes referenced by $contextSource nodesTarget nodes under constructionDecision kind
$context/books/@lastUpdate<resources>@lastUpdate@updatedAtA?
$context/books/@lastUpdate<resources>@lastUpdate@updatedAtA=?
$context/books/book<resources><book><publication>#E?
$context/py<book><py>@publicationYearA?
$context/py<book><py>@publicationYearA=?
$context/@isbn/string()<book>@isbn<isbn>E=?
$context/title/string()<book><title><title>E=?
$context/author<book><author><creator>#E=?
'Author'--<creatorRole>E=?
$context/string()<author><author><creatorName>E=?

This example suggests a great potential for code generation based on the target document model. This is best prepared by introducing a few abstractions.