Preliminaries

Our initial investigations explored a number of available tools for source code conversion. The only one that looked at all promising was a commercial product, Tangible[6]. We bought a license to evaluate its capabilities, and the exercise taught us a lot about where the difficulties were going to arise. It was immediately apparent that we would have considerable difficulties with Java generics, with anonymous inner classes, and with our extensive use of the Java CharSequence interface, which has no direct equivalent in .NET. The exercise also taught us that Tangible, on its own, wasn't up to the job. (Having said that, the conversions performed by Tangible helped us greatly in defining our own rules.)

Our next step was to reduce our dependence on the constructs that were going to prove difficult to convert: especially generics, and the use of CharSequence. I have described in more detail how we achieved this in blog postings: [7] [8].

Generics are difficult because although Java and C# use superficially-similar syntax (for example List<String>) the semantics are very different. In C# instances are annotated at run-time with the full expanded type, and one can therefore write run-time tests such as x is List<String>. Writing x is List will return false: List<String> is not a subtype of List. By contrast, with Java, the type parameters are used only at compile time and are discarded at run time (the process is called Type Erasure). This means that on Java, x instanceof List<String> is not allowed, while x instanceof List returns true.

We decided to reduce the scale of the problem by dropping some of our use of generics from the product. In particular, in Saxon 9.9, two key interfaces, Sequence and SequenceIterator, were defined with type parameters restricting the type of items contained in an XDM sequence, and we dropped this in Saxon 10.0. The use of type parameters here had always been somewhat unsatisfactory, for two reasons:

Most of the time, the code has to deal with sequences-of-anything: we don't know statically, when we write the Saxon code, what type of input it is going to be dealing with (that depends on the user-written stylesheet). So providing the type parameter (Sequence<Item>) simply doesn't add any value.

The XDM model for sequences has the property that an item is a sequence of length one. So Item implements Sequence<Item>. Which means that a subclass of Item, such as DateTimeValue, implements Sequence<DateTimeValue>. Which followed to its logical conclusion mans that a DateTimeValue is an Item<DateTimeValue>, and a generic item is therefore an Item<Item> (or is it an Item<Item<Item<...>>>?). Modelling the XDM structure accurately using Java generics proved very difficult, and in the end, it introduced a whole load of complexity without adding much value. Getting rid of it was welcome.

As far as the CharSequence interface is concerned, we used this extensively in interfaces where strings are passed around, to enable us to use implementations of strings other than the Java String class. For example, the whitespace that often occurs between elements in an XML document is compressed using run-length encoding as a CompressedWhitespace object, which implements the CharSequence interface, and can therefore be substituted in many cases for a Java String.

The use of CharSequence isn't perfect for this purpose, however. Firstly, it has the same problem as a Java String in that it models a string as a sequence of 16-bit UTF-16 char values, using a surrogate pair to represent Unicode astral codepoints. In XPath, strings need to be codepoint-addressible (at least for the purposes of functions such as substring() and translate()), and neither String nor CharSequence meets this requirement. There are also issues concerning comparison across different implementations of the CharSequence interface, plus the fact that many commonly used methods in the standard Java class library require the CharSequence to be converted to a String, which generally involves copying the content. In addition, the CharSequence interface doesn't guarantee immutability. For these reasons, we had already introduced another string representation, the UnicodeString, which we were using in many corners of the code, notably when processing regular expressions.

C# has no direct equivalent of CharSequence: that is, an interface which is implemented by the standard String class, but which also allows for other implementations. The interface IEnumerable<Char> comes close, but that doesn't allow for direct addressing to get the Nth character in a string.

So we decided to scrap our extensive use of CharSequence throughout the product, and replace it with our own UnicodeString interface – which allows for direct codepoint addressing, rather than char addressing with surrogate pairs. There is a performance hit in doing this, because there's a lot of conversion between String and UnicodeString when data crosses the boundary between Saxon and third-party software (notably the XML parser, but also library routines such as upperCase() and lowerCase()). However, it's sufficiently small that most users won't notice the difference, and we can mitigate it – for example we have our own UTF-8 Writer used by the Saxon serializer, and it was easy to extend the UTF-8 Writer to accept a UnicodeString as input, bypassing the conversion of UnicodeString to String prior to UTF-8 encoding.