Dominant Hierarchy


Prev		Next

Methods for representing overlapping hierarchy often need to know the dominant hierarchy in order to know which tree structure 'overrides' the others. In this proposed representation, there is no need for a concept of a dominant hierarchy. We are at liberty to create a hierarchy that reduces the fragmentation as far as possible. Therefore it is possible to adopt various different algorithms to generate different results. The format describes how to represent overlapping hierarchy, it does not dictate what the overlap should be. Therefore another valid representation of the example above would be as follows:

<book xmlns:dx="xx" dx="A,B">
    <p dx="A,B" dxTag="A">
        <seg dx="A,B" dxTag="A">
            <l dx="A,B" dxTagStart="B">Scorn not the sonnet;</l>
        </seg>
        <seg dx="A,B" dxTagStart="A">
            <l dx="A,B" dxTagEnd="B">critic, you have frowned, </l>
        </seg>
        <seg dx="A,B" dxTagEnd="A">
            <l dx="A,B" dxTagStart="B">Mindless of its just honours;</l>
        </seg>
        <seg dx="A,B" dxTag="A">
            <l dx="A,B" dxTagEnd="B">with this key </l>
            <l dx="A,B" dxTagStart="B">
                <dx:textGroup dx="A,B">
                    <dx:text dx="A">SHAKESPEARE</dx:text>
                    <dx:text dx="B">Shakespeare</dx:text>
                </dx:textGroup> 
                unlocked his heart;</l>
        </seg>
        <seg dx="A,B" dxTag="A">
            <l dx="A,B" dxTagEnd="B">the melody </l>
            <l dx="A,B" dxTag="B">Of this small lute gave ease to Petrarch's
                wound.</l>
        </seg>
    </p>
</book>

We can also take this a step further, and look at the representation for what might be called full fragmentation, i.e. each piece of text that has a different set of ancestors is put into a single fragment. It would also be possible to treat the paragraph element in the same way, but ideally this can be kept as a single element around all of the text, providing a clearer and simpler representation.

<book dx="A,B">
    <p dx="A,B" dxTag="A">
        <seg dx="A,B" dxTag="A">
            <l dx="A,B" dxTagStart="B">Scorn not the sonnet;</l>
        </seg>
        <seg dx="A,B" dxTagStart="A">
            <l dx="A,B" dxTagEnd="B">critic, you have frowned, </l>
        </seg>
        <seg dx="A,B" dxTagEnd="A">
            <l dx="A,B" dxTagStart="B">Mindless of its just honours;</l>
        </seg>
        <seg dx="A,B" dxTagStart="A">
            <l dx="A,B" dxTagEnd="B">with this key </l>
        </seg>
        <seg dx="A,B" dxTagEnd="A">
            <l dx="A,B" dxTagStart="B">
                <dx:textGroup dx="A,B">
                    <dx:text dx="A">SHAKESPEARE</dx:text>
                    <dx:text dx="B">Shakespeare</dx:text>
                </dx:textGroup>
                unlocked his heart;</l>
        </seg>
        <seg dx="A,B" dxTagStart="A">
            <l dx="A,B" dxTagEnd="B">the melody </l>
        </seg>
        <seg  dx="A,B" dxTagEnd="A">
            <l dx="A,B" dxTag="B">Of this small lute gave ease to Petrarch's wound.</l>
        </seg>
    </p>
</book>

The actual hierarchy of the overlapping elements can be determined based on any criteria. One criterion might be to minimise the fragmentation. The results of an automated generation of the above by comparing the two documents and aligning them according to their text content is shown below. In this example the attribute names are shown in full, e.g. dx attribute is shown as deltaxml:deltaV2 and its content indicates whether the two documents are equal, i.e. "A=B" or not equal, i.e. "A!=B". The hierarchy is reconstructed to reduce fragmentation.

<book xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
    deltaxml:deltaV2="A!=B"
    deltaxml:version="2.1" deltaxml:content-type="full-context">
    <p deltaxml:deltaV2="A!=B" deltaxml:deltaTag="A">
        <seg deltaxml:deltaV2="A!=B" deltaxml:deltaTag="A">
            <l deltaxml:deltaV2="A!=B" deltaxml:deltaTagStart="B" 
                >Scorn not the sonnet;</l>
        </seg>
        <l deltaxml:deltaV2="A!=B" deltaxml:deltaTagMiddle="B"> </l>
        <seg deltaxml:deltaV2="A!=B" deltaxml:deltaTag="A">
            <l deltaxml:deltaV2="A!=B" deltaxml:deltaTagEnd="B">critic,
            you have frowned,</l>
            <l deltaxml:deltaV2="A!=B" deltaxml:deltaTagStart="B">Mindless
            of its just honours;</l>
        </seg>
        <l deltaxml:deltaV2="A!=B" deltaxml:deltaTagMiddle="B"> </l>
        <seg deltaxml:deltaV2="A!=B" deltaxml:deltaTag="A">
            <l deltaxml:deltaV2="A!=B" deltaxml:deltaTagEnd="B">with this key</l>
            <l deltaxml:deltaV2="A!=B" deltaxml:deltaTagStart="B">
               <deltaxml:textGroup deltaxml:deltaV2="A!=B">
                   <deltaxml:text deltaxml:deltaV2="A"
                        >SHAKESPEARE</deltaxml:text>
                   <deltaxml:text deltaxml:deltaV2="B"
                        >Shakespeare</deltaxml:text>
               </deltaxml:textGroup>
               unlocked his heart;</l>
        </seg>
        <l deltaxml:deltaV2="A!=B" deltaxml:deltaTagMiddle="B"> </l>
        <seg deltaxml:deltaV2="A!=B" deltaxml:deltaTag="A">
            <l deltaxml:deltaV2="A!=B" deltaxml:deltaTagEnd="B">the melody</l>
            <l deltaxml:deltaV2="A!=B" deltaxml:deltaTag="B">Of this
                small lute gave ease to Petrarch's wound.</l>
        </seg>
    </p>
</book>

In addition there are several elements that contain only white space, e.g. the second <l> element. This is because the A document contained a space between the two <seg> elements:

<seg>Scorn not the sonnet;</seg> <seg>critic, you have frowned,
Mindless of its just honours;</seg>

The B document had this space within the <l> element:

<l>Scorn not the sonnet; critic, you have frowned,</l>

As mentioned earlier, correct handling of white space is often very complicated because a careful distinction needs to be made between white space that can be ignored and white space that is part of the content. Element boundaries are not always word separators, for example elements that represent formatting are not considered word separators whereas a new line would be considered a word separator. This is often not clearly specified or represented in the XML schema.

The overlapping hierarchy representation described here is therefore suited to a number of different situations.