Architecture of the Converter

The overall structure of the transpiler is shown below:

Let's explain this:

The first stage of conversion is to parse each Java module and generate an abstract syntax tree, which can be serialized as XML. For this purpose we use the open-source JavaParser product[9].

JavaParser generates the parse tree as a hierarchy of (not very well documented) Java objects. It also includes the capability to serialize this hierarchy as XML. We don't use its out-of-the-box serialization however: we augment it with additional semantic information. JavaParser in fact has two parts (originally developed independently, and still showing evidence of the fact): the parser itself, which is exactly what it says, and the "symbol solver", which is a set of queries that can be executed on the parse tree to obtain additional information. For example, if the raw source code contains the expression new HashMap<>(), this will appear in the raw tree as:

<value nodeType="ObjectCreationExpr">
    <type nodeType="ClassOrInterfaceType">
       <name nodeType="SimpleName" identifier="HashMap"/>
       <typeArguments/>
    </type>
</value>

But with the aid of the symbol solver, it is straightforward to establish that the name HashMap refers to the class java.util.HashMap, and we output this as an additional attribute on the tree, thus:

<value nodeType="ObjectCreationExpr">
    <type nodeType="ClassOrInterfaceType" 
          RESOLVED_TYPE="java.util.HashMap">
       <name nodeType="SimpleName" identifier="HashMap"/>
       <typeArguments/>
    </type>
</value>

Similarly, the symbol solver is usually able to find the declaration corresponding to a variable reference or method call, and hence to establish the static type of the variable or of the method result. I say usually, because there are cases it gives up on. It struggles, for example, with the types of the arguments to a lambda expression, for example the variable n in

search.setPredicate(n -> n.name="John")

Similarly it has difficulty with static wildcard imports:

import static org.w3.dom.Node.*;

The other problem with the symbol solver is that it can do a lot of things that aren't mentioned in the documentation: we've found some of these by experiment, or by studying the source code. No doubt there are other gems that remain hidden.

The result of this process is that for each Java module in the product, we generate a corresponding XML file containing its decorated syntax tree.

In principle we could now write an XSLT transformation that serializes this syntax tree using C# syntax. But there's another step first. In some cases we can't generate the C# one file at a time: we need some global information. For example, if a C# method is to be overridden in a subclass, it needs to be flagged with the virtual modifier. Similarly, overriding methods need to be flagged as override. We therefore need to construct a map of the entire class hierarchy, working out which methods are overridden and which are overrides.

So the second phase of processing is to scan the entire collection of XML documents and generate a digest file (itself an XML document, naturally) which acts as an index of classes, interfaces, and methods, and which represents the class hierarchy of the application. Then (our third phase) we do a transformation on the digest file which augments it with decisions about which methods are overriding and which are virtual.

Now finally we can perform the XML-to-C# phase, implemented as an XSLT transformation applied to each of the XML documents generated in phase one, but with the digest file available as additional information.

The C# is then ready to be compiled and executed.