Converting the input XML to JSON

This was an opportunity to try out the new fn:element-to-map function. I described the basic design for this function in a paper at Balisage in 2023[Kay 2023], and we had an implementation in Saxon ready to test.

The function doesn't convert lexical XML to lexical JSON: rather, it converts the XDM representation of XML to the XDM representation of JSON (XDM being the X data model that underpins XSLT and XQuery).

The idea of the function is that it converts each element type in one of a dozen or so different ways, depending on the content model. The content model can be inferred either from a schema, or from examination of a sample collection of input documents, or from an individual element instance. I didn't try to do a schema-aware conversion because (a) no schema was available, and (b) generating one wouldn't be particularly useful, because of the way the particular XML vocabulary works. Specifically, a typical fragment of the XML (representing the Java code if (iter.next() != null) {iter.close(); return BooleanValue.FALSE}) looks like this:

 <statement nodeType="IfStmt">
    <condition nodeType="BinaryExpr" operator="NOT_EQUALS">
       <left nodeType="MethodCallExpr" >
          <name nodeType="SimpleName" identifier="next"/>
          <scope nodeType="NameExpr">
             <name nodeType="SimpleName" identifier="iter"/>
          </scope>
       </left>
       <right nodeType="NullLiteralExpr"/>
    </condition>
    <thenStmt nodeType="BlockStmt">
       <statements>
          <statement nodeType="ExpressionStmt">
             <expression nodeType="MethodCallExpr" >
                <name nodeType="SimpleName" identifier="close"/>
                <scope nodeType="NameExpr">
                   <name nodeType="SimpleName" identifier="iter"/>
                </scope>
             </expression>
          </statement>
          <statement nodeType="ReturnStmt">
             <expression nodeType="FieldAccessExpr">
                <name nodeType="SimpleName" identifier="FALSE"/>
                <scope nodeType="NameExpr">
                   <name nodeType="SimpleName" 
                         identifier="BooleanValue"/>
                </scope>
             </expression>
          </statement>
       </statements>
    </thenStmt>
 </statement>           
      

The element names here (left, right, condition, thenStmt) tell you nothing about what kind of thing the element is (and therefore what kind of structure its content has); rather it tells you about where it fits into the structure of the parent element. It's the nodeType attribute that tells you about the content model: if nodeType="condition" then there will be children named left and right, while if nodeType="IfStmt" then there will be children named condition, thenStmt, and optionally elseStmt.

This design, as well as making it difficult to construct a schema, also makes it difficult for the element-to-map function to infer the right XML-to-JSON mapping to use for each element name.

Another consequence of the design is that most of the transpiler consists of rules of the form match="*[@nodeType='MethodCallExpr']": it is the nodeType attribute that drives the processing, not the element name.

The elements-to-maps() function, as it was specified and implemented at the time, had an option 'uniform':true() that caused the function to analyze the entire input and infer a mapping for each element name that took into account all the elements encountered with that name. By running this against the entire collection of 2100 input files, it ended up making quite reasonable decisions, so far as one could tell. However, for constructs that only appeared very rarely, it might have made a poor choice, and I probably wouldn't have noticed because, in absence of a complete implementation of the transpiler, we didn't get as far as testing that we were generating correct C# code at the end of the process.

It also became clear that examining all the structures that occur in the input to the function doesn't necessarily give the right answer if you run the same conversion on a different set of input files the following day. Because there is downstream code processing the JSON output, it is vital that tomorrow's output is consistent with today's.

This experience led to a decision to make the choices made by the processor more visible and open to scrutiny and adjustment. We split the function into two: element-to-map-plan() takes a corpus of XML documents and generates a conversion plan, specifically a so-called "layout" to be used for each element name. The second function, element-to-map(), accepts this plan as input, and uses it to guide a specific conversion. The plan is designed so it can itself be serialized as JSON, which means that (a) it can be modified by hand, and (b) the same plan can be used consistently every time a conversion is run, even though the original data is no longer available.

The JSON version of the XML fragment shown above ends up looking like this:

 {"_nodeType":"IfStmt",
   "condition":{"_nodeType":"BinaryExpr",
     "_operator":"NOT_EQUALS",
     "left":{"_nodeType":"MethodCallExpr",
       "name":{"_nodeType":"SimpleName",
         "_identifier":"next"
       },
       "scope":{"_nodeType":"NameExpr",
         "name":{"_nodeType":"SimpleName",
           "_identifier":"iter"
         }
       }
     },
     "right":{"_nodeType":"NullLiteralExpr"
     }
   },
   "thenStmt":{"_nodeType":"BlockStmt",
     "statements":[{"_nodeType":"ExpressionStmt",
         "expression":{"_nodeType":"MethodCallExpr",
           "name":{"_nodeType":"SimpleName",
             "_identifier":"close"
           },
           "scope":{"_nodeType":"NameExpr",
             "name":{"_nodeType":"SimpleName",
               "_identifier":"iter"
             }
           }
         }
       },
       
       {"_nodeType":"ReturnStmt",
         "expression":{"_nodeType":"FieldAccessExpr",
           "name":{"_nodeType":"SimpleName",
             "_identifier":"FALSE"
           },
           "scope":{"_nodeType":"NameExpr",
             "name":{"_nodeType":"SimpleName",
               "_identifier":"BooleanValue"
             }
           }
         }
       }
     ]
   }
 }      
      

I decided to use "_" rather than "@" as a prefix for JSON properties derived from XML attributes, on the grounds that the result is a valid NCName and can therefore be more easily selected using the XPath lookup operator, for example $node?_nodeType. The element-to-map function allows any prefix (or none) to be used.

The fragments shown above illustrate the XML and JSON representations of the raw Java parse tree. In practice the JavaParser product also has an option (the type solver) to decorate the parse tree with additional attributes containing the inferred types of various constructs, and their expanded names. For example the left node of the first condition above has two additional properties: "_RETURN": "net.sf.saxon.value.AtomicValue", and "_RESOLVED_TYPE": "com.saxonica.functions.qt4.DuplicateValues.DuplicatesIterator", indicating that in the Java method call iter.next(), the type of iter is com.saxonica.functions.qt4.DuplicateValues.DuplicatesIterator, and the return type of the method call is net.sf.saxon.value.AtomicValue.