XML serialization

Serialization of any data into XML has its challenges. First of all, DBUnit has to figure out the type of the data field. Is it:

  • text

  • numeric

  • binary

If it's binary, then it gets embedded into XML as a Base64 encoded string. All the 64 characters this encoding uses are valid XML characters.

Serializing numeric values is easy. However, text data can contain characters which are invalid in XML. For example, some legacy applications used "control characters" - below decimal 32 (space char). Most of those are invalid in XML 1.0. XML 1.1 expands the set of allowed characters.

Of course, we could work around this by Base64 encoding all textual data, but this would result in cumbersome processes when we query the data.

In projects, we did stick with XML 1.0, and when we faced illegal XML characters in the DBUnit output, mostly we could simply remove them safely with a low level (text) process.

Data types

Getting the data types converted correctly is a challenging part of the process. SQL implementations bring vendor-specific data types. All of them have to be mapped accurately to XML schema data types. This is done via configuration.

Row customizer

In our framework, we tried to implement the process to be as generic as possible, but we also added customization options.

Our row customizer is a custom XQuery that gets a single row as input and can filter or modify it.

Format conversion

The serialization format DBUnit uses is not the best when we want to query the data set. It's worth to write generic code to turn XML with column names as text, into XML with column names used as element names.

Example 5. XML data using the relational model - better semantics: