Exploring Data

The author has found that having a large number of sample input documents significantly increases the chance a project will be successful. Thousands of tens of thousands of journal articles, for example, selected across every possible journal and journal publisher involved in a project, will likely mean that almost all likely situations will actually occur, and that differences between schemas or DTDs and actual data will be discovered.

Having the right schemas, DTDs, entity files, and input data, and the right documentation is essential.

If there is sample output, it should be compared to what is generated, so having corresponding input and output is a major help.

If you are faced with exploring, say, five thousand documents, the first thing is to validate them. Clearly you don’t want to lead each document by turn into an XML editor and press Validate. But a simple XSLT stylesheet might work:

<xsl:template match="/">
      <xsl:for-each select="uri-collection('data/*.xml')">
         <xsl:try select="doc(.)">
           <xsl:catch>{.} failed</xsl:catch>
         </xsl:try>
      </xsl:for-each>
    </xsl:template>

Exact details will vary depending on how the XSLT implementation handles collections; an alternative is a simple bash script that writes a list of files, and using unparsed-text-lines() to read a URI at a time, instead of using collection(), perhaps like this:

ls data > file-list.txt

or more, if escaping special characters such as & is needed.

If validating is too slow, you could use xmllint instead of Saxon on most platforms (you might need to install it). Since modern computers support multiple programs running at the same time independently, you could validate groups of sample files in parallel, again maybe with a script.

When you write small scripts in this way, make sure to put a comment at the start of each to say how to use them, and make a Makefile or a runme.sh file that runs them, to make it easy to return to this task later.

Validating files usually involves setting up an XML catalog; you might find it helpful to use strace as a wrapper to see which XML catalog files are being opened.

If you do this sort of basic exploration often, consider writing a simple tool. A half hour spent working on it will more than repay itself on the second project, or later in the first one. It could be a simple shell script and take options for whether to trace files opened, and in which directory (folder) to look for data files, and which catalog to use.

This level of scripting is very easy and highly productive. Be very careful to remember the comments, though:

#! /bin/sh
  # validate all files in the data directory
  # options: -trace - turn on catalog file tracing
  TRACE=
  if test "$1" = "-trace"
  then
      TRACE="strace -e openat,open"
      shift; # delete the -trace option
  fi
  $TRACE xml-validate "$@"

There is no need to use 1960s big-business style comments with dates and who wrote what. If necessary, use a git repository, for example on gitlab.com, and store all your scripts there, and git will tell you who added each line and when, if you need to know.

Writing simple tools like this will increase your confidence and skills. Even if you are an experienced programmer with decades of scripting experience, the exercise will get you into the right head-space for working in the project. Think of it as like doing lunge exercises before a race!