The Gaps Between XML and LaTeX


Prev		Next

XML and LaTeX were designed to serve different purposes but the most obvious difference is the syntax. TeX commands usually start with backslashes, may contain optional arguments in square brackets and mandatory arguments in curly braces. A LaTeX document starts with a preamble that includes the document class, metadata and references to packages. The document body is placed between \begin{document} and \end{document}. A minimal example of a LaTeX document is shown below:

\documentclass{article}
\usepackage[british]{babel}
\title{my Markup UK paper}
\author{Martin Kraetke}
\date{\today}
\begin{document}
\maketitle
\section{Introduction}
This is a paragraph with \textit{italicized text}.
\end{document}

While the principles of this syntax are consistent, LaTeX has two main modes with syntactical differences: Text mode is used for regular text and math mode is used for mathematical expressions.

In text mode, special characters such as backslash, curly braces, and percent signs have a specific meaning and if they are part of the regular text, they need to be escaped with the backslash character. There are various LaTeX macros to markup the text. For example, \textsubscript{…} and \textsuperscript{…} are used for sub‐ and superscripted text.
Math mode has a slightly different syntax. Math mode can be entered using either LaTeX environments like \begin{equation}…\end{equation} or dollar signs ($ or $$). Furthermore, math mode introduces new commands and special characters for mathematical expressions. In contrast to text mode, the underscore symbol (_) indicates a subscript and the caret symbol denotes a superscript.

Another difference between XML and TeX is Unicode‐compatibility. LaTeX was developed before Unicode and had limited support for non‐ASCII characters. Support for Unicode in LaTeX depends on the TeX rendering engines being used. The traditional pdfTeX engine is not Unicode‐compatible by default. The input encoding needs to be set correctly and additional font packages need to be used because the original default font encoding (OT1) of TeX was 7‐bit and used fonts that only had 128 glyphs. LuaTeX and XeTeX are newer TeX engines and both designed to work with Unicode and OpenType fonts. In summary, when you convert XML to TeX, you need to know which TeX engine is used.

While XML has namespaces to add new semantics, LaTeX allows you to include additional functionality through document classes, packages and custom commands or macros. These mechanisms can extend the capabilities of LaTeX by adding new features or modify existing ones. The sheer number of available packages is one of the major advantages of LaTeX, but can become also a challenge. Certain packages have different approaches for the same problem which results sometimes in an overlapping feature set: For instance, if you just want to underline text, you can do this in several ways. As alternative to the built‐in macro \underline{...}, the soul package provides \ul{…} and the ulem package has \uline{…} for the same purpose. You can also define custom macros with \newcommand{\name}{definition}. The fact that TeX/LaTeX is a programming language in its own right and not just a pure markup language makes processing LaTeX files with other programming languages sometimes a challenge. Parsing XML and converting it to LaTeX is definitely easier than parsing arbitrary LaTeX and converting it to XML.

Even though there are a number of macros that are built‐in LaTeX and can be considered standard, the choice of macros depends on the available package, the use case and personal preferences. This aspect must be taken into account when implementing an XML to TeX transformation.

Many XML grammars include the CALS or the HTML table model, or offer different schema variants with these models. In LaTeX, tables are a feature that is provided by external packages like tabularx^[5]. Here is a brief example:

\begin{tabularx}{|c|c|}
Firstname &amp; Surname \\
\hline
Geert &amp; Bormans \\
Ari &amp; Nordström \\
Andrew &amp; Sales \\
Rebecca &amp; Shoob \\
\end{tabularx}

This code creates a table with four rows and two columns. The first argument of tabular provides the column declaration. The letter “c” creates a centered column and the pipe (|) symbol indicates a vertical line. Table cells are separated with an ampersand symbol (&) and each row ends with a double backslash (\\). \hline is used to indicate a horizontal line.

If you want to insert a cell that spans across several columns, you can use the \multicolumn instruction. Despite the fact that you are able to adjust the column span with tabularx, the functionality to have row spans as well is not included. Therefore, you need to add another package called multirow^[6]. Even though you can use multirow together with tabularx, it seems a bit odd to have different packages for a feature that the author would consider as basic functionality. Of course, you need additional packages for colored borders and cells, when you want to wrap tables over several pages or rotate a table.

^[5] David Carlisle (2020): The tabularx package. Available at http://mirrors.ctan.org/macros/latex/required/tools/tabularx.pdf (Accessed: May 16, 2023)

^[6] Pieter van Oostrum et al. (2021) The multirow, bigstrut and bigdelim packages. Available at http://mirrors.ctan.org/macros/latex/contrib/multirow/multirow.pdf (Accessed: May 15, 2023)