Introduction and Background

Our focus has been on the representation of change to structured documents, and in this work our original objective was to find a way to represent change to structure, and this turned out to provide a useful representation for overlapping hierarchy, but with the advantage that other changes could also be represented.

Jeni Tennison says in one of her excellent blogs, "Overlap is arguably the main remaining problem area for markup technologists." [1]. She points out that this is not only an issue for academics looking at poetry and historical documents, but is also an issue in managing change to structured documents. The example she cites is legislation which is amended over time where the authors are not concerned about changes to structure, their primary interest is in the textual changes.

There are a number of different approaches to this problem, and some excellent reviews of the advantages and disadvantages of the approaches [2] [3] [4] [5]. Our own goal is to represent changes to documents, such as versions of documents over a period of time as they are amended, and to represent them in a way that is easy to process. This reflects the classic advantage of XML, where content can be re-purposed to meet different needs. If the document can be re-purposed, then we need to be able to re-purpose changes also, and this means changes need to be represented in way that is easy to process.

Ignoring for the moment changes to attributes, most changes can be represented by the addition and deletion of elements and their content. Additionally, we need to be able to mark segments of text that are either added or deleted. This approach allows us to represent any change, although not always in an optimal way. For example, in the extreme, the deletion of the 'old' document and addition of the 'new' document correctly represents the changes, but not in a very useful way. This leads to the observation that by duplicating content it is always possible to represent a change in a structured document. The problem is that we do not wish to duplicate content because this appears to the user as a change to the content, whereas in practice the only change may be to the structural markup around that content. This leads to the need to represent the addition, deletion, and overlapping of structural elements representing hierarchy.

The TEI format [6] has powerful, though complex, ways of representing different hierarchies, and also variants of text within a document. The goal is to provide rich semantic information about the document, representing all of this information in a single place. Using this semantically rich representation, it would be possible to generate all the different variants of the document, including variants of the text and variants of the hierarchy. When we are considering change, it is essentially all these different variants that we use as a starting point. Therfore in this respect our goal is very similar to, but not quite the same as, the goal of the TEI format. As our starting point is a set of document variants, it is natural that we clearly identify each of these source variants in the single merged document. We therefore always make a very precise differentiation between two overlapping structures, because these are considered to have come from different source documents.

The inherent model that we adopt here, i.e. one that addresses the representation of variants of the whole document, is important because it does differ from a model where the desire is to represent variants in structure within a document. The latter model can lead to a very large number of whole document variants, and our model is not well suited to a large number of variants because the attribute values representing the variants become long and therefore difficult to manage. Our model addresses primarily overlap in the context of change to a document and is not intended as a solution to all overlap representation problems.

Although TEI has these mechanisms, most XML document formats, such as DITA[7] or DocBook[8], do not and would therefore benefit from a way of representing overlap. In these formats, overlap representation is needed in order to better represent change. There is a clear advantage to having a standard way to enhance an existing schema with change and overlap representation because structured document editing applications then need to understand only one way of handling this. Schmidt [9] suggests that a good way to manage documents that have overlapping hierarchy is to split them into separate documents and merge them as needed, though this idea does not seem to have gained a significant following.

There is another distinguishing feature of this solution. In other solutions for representing overlap, identifier attributes (which may or may not be strictly of type xml:id) are often used to indicate which fragments are part of the same element, but with this solution there is no such use of identifier attributes. The problem with using identifier attribtues is that it is difficult to denote a fragment that is part of two separate hierarchies because only one identifier attribute can be present on each element. The identifier attribute could contain a list of identifiers but this does lead to make it more difficult to process.

The representation described here is pure XML. As such, standard XML processing tools such as XSLT and XQuery can be used to process it. Each of the original document variants can be extracted: this was our primary goal and is an important feature. We have verified that it is quite simple in XSLT to extract a single version, and it is simple to determine the ancestors of a particular element or piece of text. We are currently researching alternative types of processing. One XSLT approach shows particular promise for processing n-way comparison results. This uses a template that employs sibling recursion and XSLT 3.0 maps, the maps keep track of the state of each tree using an extension to the principle of a common stack.

There are validation rules, which we express in Schematron, for this representation. Validation against the original schema of the source documents would need to be done by extracting each version and validating it. In other words, we can assert that the representation is correct if the Schematron rules are passed and if we can extract each of the original documents correctly, i.e. the extracted document is deep equal to the original.