This was an opportunity to try out the new fn:element-to-map
function.
I described the basic design for this function in a paper at Balisage in 2023[Kay 2023],
and we had an implementation in Saxon ready to test.
The function doesn't convert lexical XML to lexical JSON: rather, it converts the XDM representation of XML to the XDM representation of JSON (XDM being the X data model that underpins XSLT and XQuery).
The idea of the function is that it converts each element type in one of a dozen or so different ways,
depending on the content model. The content model can be inferred either from a schema, or from examination of
a sample collection of input documents, or from an individual element instance. I didn't try to do a schema-aware
conversion because (a) no schema was available, and (b) generating one wouldn't be particularly useful, because
of the way the particular XML vocabulary works. Specifically, a typical fragment of the XML
(representing the Java code if (iter.next() != null) {iter.close(); return BooleanValue.FALSE}
)
looks like this:
<statement nodeType="IfStmt"> <condition nodeType="BinaryExpr" operator="NOT_EQUALS"> <left nodeType="MethodCallExpr" > <name nodeType="SimpleName" identifier="next"/> <scope nodeType="NameExpr"> <name nodeType="SimpleName" identifier="iter"/> </scope> </left> <right nodeType="NullLiteralExpr"/> </condition> <thenStmt nodeType="BlockStmt"> <statements> <statement nodeType="ExpressionStmt"> <expression nodeType="MethodCallExpr" > <name nodeType="SimpleName" identifier="close"/> <scope nodeType="NameExpr"> <name nodeType="SimpleName" identifier="iter"/> </scope> </expression> </statement> <statement nodeType="ReturnStmt"> <expression nodeType="FieldAccessExpr"> <name nodeType="SimpleName" identifier="FALSE"/> <scope nodeType="NameExpr"> <name nodeType="SimpleName" identifier="BooleanValue"/> </scope> </expression> </statement> </statements> </thenStmt> </statement>
The element names here (left
, right
, condition
,
thenStmt
) tell you nothing about what kind of thing the element is (and therefore
what kind of structure its content has); rather it tells you about where it fits into the structure
of the parent element. It's the nodeType
attribute that tells you about the
content model: if nodeType="condition"
then there will be children named
left
and right
, while if nodeType="IfStmt"
then
there will be children named condition
, thenStmt
, and optionally
elseStmt
.
This design, as well as making it difficult to construct a schema, also makes it
difficult for the element-to-map
function to infer the right XML-to-JSON mapping
to use for each element name.
Another consequence of the design is that most of the transpiler consists of rules
of the form match="*[@nodeType='MethodCallExpr']"
: it is the nodeType
attribute that drives the processing, not the element name.
The elements-to-maps()
function, as it was specified and implemented at the time,
had an option 'uniform':true()
that caused the function to analyze the entire input
and infer a mapping for each element name that took into account all the elements encountered with
that name. By running this against the entire collection of 2100 input files, it ended up making quite
reasonable decisions, so far as one could tell. However, for constructs that only appeared
very rarely, it might have made a poor choice, and I probably wouldn't have noticed because, in
absence of a complete implementation of the transpiler, we didn't get as far as testing that
we were generating correct C# code at the end of the process.
It also became clear that examining all the structures that occur in the input to the function doesn't necessarily give the right answer if you run the same conversion on a different set of input files the following day. Because there is downstream code processing the JSON output, it is vital that tomorrow's output is consistent with today's.
This experience led to a decision to make the choices made by the processor more visible and
open to scrutiny and adjustment. We split the function into two: element-to-map-plan()
takes a corpus of XML documents and generates a conversion plan, specifically a so-called "layout"
to be used for each element name. The second function, element-to-map()
, accepts this
plan as input, and uses it to guide a specific conversion. The plan is designed so it can itself
be serialized as JSON, which means that (a) it can be modified by hand, and (b) the same plan can
be used consistently every time a conversion is run, even though the original data is no longer
available.
The JSON version of the XML fragment shown above ends up looking like this:
{"_nodeType":"IfStmt", "condition":{"_nodeType":"BinaryExpr", "_operator":"NOT_EQUALS", "left":{"_nodeType":"MethodCallExpr", "name":{"_nodeType":"SimpleName", "_identifier":"next" }, "scope":{"_nodeType":"NameExpr", "name":{"_nodeType":"SimpleName", "_identifier":"iter" } } }, "right":{"_nodeType":"NullLiteralExpr" } }, "thenStmt":{"_nodeType":"BlockStmt", "statements":[{"_nodeType":"ExpressionStmt", "expression":{"_nodeType":"MethodCallExpr", "name":{"_nodeType":"SimpleName", "_identifier":"close" }, "scope":{"_nodeType":"NameExpr", "name":{"_nodeType":"SimpleName", "_identifier":"iter" } } } }, {"_nodeType":"ReturnStmt", "expression":{"_nodeType":"FieldAccessExpr", "name":{"_nodeType":"SimpleName", "_identifier":"FALSE" }, "scope":{"_nodeType":"NameExpr", "name":{"_nodeType":"SimpleName", "_identifier":"BooleanValue" } } } } ] } }
I decided to use "_" rather than "@" as a prefix for JSON properties derived from XML attributes,
on the grounds that the result is a valid NCName and can therefore be more easily selected using
the XPath lookup operator, for example $node?_nodeType
. The element-to-map
function
allows any prefix (or none) to be used.
The fragments shown above illustrate the XML and JSON representations of the raw Java parse tree.
In practice the JavaParser product also has an option (the type solver) to decorate the parse tree
with additional attributes containing the inferred types of various constructs, and
their expanded names. For example the left
node of the first condition
above has two additional properties: "_RETURN": "net.sf.saxon.value.AtomicValue"
,
and "_RESOLVED_TYPE": "com.saxonica.functions.qt4.DuplicateValues.DuplicatesIterator"
,
indicating that in the Java method call iter.next()
, the type of iter
is com.saxonica.functions.qt4.DuplicateValues.DuplicatesIterator
, and the return
type of the method call is net.sf.saxon.value.AtomicValue
.