Aggregation

The most challenging step in this data flow is to re-model the relational data into aggregated, hierarchical data. A relational database table becomes one (or more) XML document during the data export, which can be quite large, often several gigabytes. During the aggregation, we need to "​join​" these large "table documents".

We can use SAX, StAX or XSLT streaming mode to process large files, but the parallel processing of several large documents is problematic. Here comes the power of an XML database and XQuery. Configuring proper indexes is a must.

As I have described before, the complexity of the XQuery is not too bad, since we tend to reuse the same patterns multiple times.

This XQuery is specific to the database schema we work with, and cannot be written generically. This query script is the only part of the data pipeline which is custom made for each relational database schema; the other operations are generic and work with any database.

Most importantly, we need to have a decent knowledge of the relational database schema, and also the end user requirements about how this data will be searched and used.

Example 6. Aggregated records in hierarchical XML:

<travels>
    <travel id="t1">
        <date>2019-05-15</date>
        <reason>Customer meeting in Oslo</reason>
        <employee id="e1">
            <first-name>Ola</first-name>
            <second-name>Nordmann</second-name>
        </employee>
        <transactions>
            <transaction id="c1">
               <item-name>bus ticket</item-name>
               <cost currency="NOK">100</cost>
            </transaction>
            <transaction id="c2">
                <item-name>accommodation</item-name>
                <cost currency="NOK">1000</cost>
            </transaction>
        </transactions>
    </travel>
    <travel>...</travel>
    ...
</travels>

It's also possible that the data from one database is not aggregated into a single list of records, but into multiple lists of records using different XML schemas.