Our initial investigations explored a number of available tools for source code conversion.
The only one that looked at all promising was a commercial product,
Tangible[6]. We bought a license
to evaluate its capabilities, and the exercise taught us a lot about where the difficulties were
going to arise. It was immediately apparent that we would have considerable difficulties with Java
generics, with anonymous inner classes, and with our extensive use of the Java CharSequence
interface,
which has no direct equivalent in .NET. The exercise also taught us that Tangible, on its own, wasn't
up to the job. (Having said that, the conversions performed by Tangible helped us greatly in defining
our own rules.)
Our next step was to reduce our dependence on the constructs that were going to prove difficult
to convert: especially generics, and the use of CharSequence
. I have described in more detail how we
achieved this in blog postings: [7]
[8].
Generics are difficult because although Java and C# use superficially-similar syntax (for example
List<String>
) the semantics are very different.
In C# instances are annotated at run-time with the full expanded type, and one can therefore write
run-time tests such as x is List<String>
. Writing x is List
will return false: List<String>
is not a subtype of List
. By contrast, with Java, the type parameters are used only at compile time
and are discarded at run time (the process is called Type Erasure). This means that on Java,
x instanceof List<String>
is not allowed, while x instanceof List
returns true.
We decided to reduce the scale of the problem by dropping some of our use of generics
from the product. In particular, in Saxon 9.9, two key interfaces, Sequence
and SequenceIterator
,
were defined with type parameters restricting the type of items contained in an XDM sequence,
and we dropped this in Saxon 10.0. The use of type parameters here had always been somewhat
unsatisfactory, for two reasons:
Most of the time, the code has to deal with sequences-of-anything: we don't know statically,
when we write the Saxon code, what type of input it is going to be dealing with (that depends on
the user-written stylesheet). So providing the type parameter (Sequence<Item>
) simply doesn't
add any value.
The XDM model for sequences has the property that an item is a sequence of length one. So
Item
implements Sequence<Item>
. Which means that a subclass of Item
, such as DateTimeValue
,
implements Sequence<DateTimeValue>
. Which followed to its logical conclusion mans that a
DateTimeValue
is an Item<DateTimeValue>
, and a generic item is therefore an Item<Item>
(or is it an Item<Item<Item<...>>>
?). Modelling the XDM structure accurately using
Java generics proved very difficult, and in the end, it introduced a whole load of complexity
without adding much value. Getting rid of it was welcome.
As far as the CharSequence
interface is concerned, we used this extensively in interfaces
where strings are passed around, to enable us to use implementations of strings other than the
Java String
class. For example, the whitespace that often occurs between elements in an XML document
is compressed using run-length encoding as a CompressedWhitespace
object, which implements the
CharSequence
interface, and can therefore be substituted in many cases for a Java String
.
The use of CharSequence
isn't perfect for this purpose, however. Firstly, it has the same problem
as a Java String
in that it models a string as a sequence of 16-bit UTF-16 char values, using a
surrogate pair to represent Unicode astral codepoints. In XPath, strings need to be codepoint-addressible
(at least for the purposes of functions such as substring()
and translate()
), and neither String
nor
CharSequence
meets this requirement. There are also issues concerning comparison across different
implementations of the CharSequence
interface, plus the fact that many commonly used methods in the
standard Java class library require the CharSequence
to be converted to a String
, which generally
involves copying the content. In addition, the CharSequence
interface doesn't guarantee immutability.
For these reasons, we had already introduced another string representation, the UnicodeString
, which
we were using in many corners of the code, notably when processing regular expressions.
C# has no direct equivalent of CharSequence
: that is, an interface which is implemented by the
standard String class, but which also allows for other implementations. The interface IEnumerable<Char>
comes close, but that doesn't allow for direct addressing to get the Nth character in a string.
So we decided to scrap our extensive use of CharSequence
throughout the product, and replace it with
our own UnicodeString
interface – which allows for direct codepoint addressing, rather than char
addressing with surrogate pairs. There is a performance hit in doing this, because there's a lot
of conversion between String
and UnicodeString
when data crosses the boundary between Saxon and
third-party software (notably the XML parser, but also library routines such as upperCase()
and
lowerCase()
). However, it's sufficiently small that most users won't notice the difference, and
we can mitigate it – for example we have our own UTF-8 Writer used by the Saxon serializer, and
it was easy to extend the UTF-8 Writer to accept a UnicodeString
as input, bypassing the conversion
of UnicodeString
to String
prior to UTF-8 encoding.