Prev		Next

Design and Performance of a Corpus Scanner

Liam Quin

Drudge
Delightful Computing

<liam@fromoldbooks.org>

Abstract

People working with large collections of XML documents often need to know specific characteristics of the documents in the collection in aggregate. For example, an attribute value that only occurs once in a million documents might warrant investigation; an element that was expected but that does not occur anywhere might similarly suggest a problem. People designing transformations or style sheets might find it useful to handle the most commonly occurring elements first.

FreqX is an XSLT-based tool that summarizes the various elements, attributes, attribute values, and other details in a collection of XML documents. It can produce several different report formats, including an HTML Web page, a CSV file for a spreadsheet, and of course XML data. It has been run on collections containing tens of thousands of documents, running into tens of gigabytes of XML.

Unfortunately, early versions of the tool used large amounts of memory—several times more memory than the actual scanned documents occupied. This made the tool unsuitable for one of its design goals.

Recently, the FreqX tool has been improved so that it runs more quickly and uses much less memory.

This paper describes some of the design decisions that were made both in the creation of FreqX and in the subsequent revision, and also the process of making the tool support large amounts of data.

The tool is written in XSLT 3, and makes use of a number of XPath and XSLT features introduced in that version. Some of these are discussed in the paper. FreqX is publicly available, including full source code, with original development funded by Mulberry Technologies. Suggestions for additional features, as well as reports of problems, are welcomed: the tool is actively maintained.

The result of the improvements was a reduction in memory usage from over sixty gigabytes to less than three gigabytes when processing the Early English Books Online corpus of some fifty-three thousand TEI-based XML documents, and a reduction in time from almost six hours before a crash down to between thirty and forty-five minutes with successful output, running in both cases with Saxon 9 on a decade-old computer.

Table of Contents

Introduction

Tool Requirements and Features

Easy to configure and run
Convenient to provide inputs
Produce multiple forms of report
Combine multiple runs into a single report
Robust against parse errors
Extensible and Maintainable

Implementation

Memory Usage and Speed