Developing the markup vocabulary for CCCP

The previous section of this paper has given an outline of a sample of the elements that are used to construct a cryptic crossword clue. In developing a markup vocabulary for CCCP, it has been necessary to consider how these elements (as well as others not discussed above for reasons of space) relate to each other, and which of them are required in order to give a comprehensive account of the structure of a clue. There are, of course, elements above the level of the clue that could also be captured, such as the source of the crossword, the name of the setter, the date on which the crossword appeared, and so on.

Making decisions about how much detail to include required some predictions about the questions that might be asked of the corpus data at a later stage of the project. Cleary (1996) suggested some possible avenues for research, including examining differences between setters, and the use of polysemous words both in clues and as targets. To this might be added comparison of the "house styles" of various newspapers, comparison of the language of clues at different points in time, and the use of words with ambiguous part of speech. Evidently, these questions require marking up not only the structural features of the clue itself, but also of the name of the publication, the setter, the date, and the parts of speech of individual words. Since I am also interested in how various fragments of the target answers are commonly (or, indeed, uncommonly) represented in the subsidiary indications, there also needs to be a way to link each part of the clue to the corresponding part of the answer.

The following example shows a single clue and answer pair within the root <corpus> element:

<corpus>
    <publication pub="independent" type="anthology" ISBN="0550101756" pubdate="2005">
        <puzzle id="1" setter="Aelred">
            <item id="4">
                <clue>
                    <subsidiary>
                        <source class="definition" id="1">
                            <word pos="JJ">Feeble</word>
                            <word pos="NN">type</word>
                        </source>
                        <meta class="locator" type="concatenation">
                            <word pos="VVZ">adopts</word>
                        </meta>
                        <source class="translation" id="2">
                            <word pos="DT">the</word>
                        </source>
                        <meta class="operator" type="translation">
                            <word pos="JJ">French</word>
                        </meta>
                    </subsidiary>
                    <def type="hypernym">
                        <word pos="NN">headdress</word>
                    </def>
                </clue>
                <solution words="1" letters="6" text="wimple" pos="NN">
                    <unit id="1">WIMP</unit>
                    <unit id="2">LE</unit>
                </solution>
            </item>
        </puzzle>
    </publication>
</corpus>

(N.B. for ease of reading, part-of-speech tagging is omitted from all examples below).

The first element within the root, <publication>, captures the published source for the crosswords. In this case, the source is an anthology, so the ISBN and the publication date are also captured, as is the publication - in this case, The Independent. Within this element, the first <puzzle> element has an id attribute (to indicate its place within the anthology) and a setter attribute. This attribute must often have the value "anon", because many anthologies fail to identify the specific setter of each puzzle. Each <puzzle> element contains a number of <item> elements; these contain the clue and answer pairs.

The clue in the above example has the structure defined in section 2 above: a subsidiary indication and a definition. These are captured inside the <clue> element, as <subsidiary> and <def>. The attribute type on <def> indicates whether the definition element is a straight synonym, a phrase, a narrative definition, or some other kind of semantic relative of the target answer; in this case, the definition is a hypernym (a term that is more general than the target answer). This information will be useful for exploring questions of how different kinds of definition might affect clue difficulty.

The child elements of <subsidiary> are <source> (for the words that are manipulated to restructure the target answer) and <meta> (for metalanguage). The source attribute class indicates how the word must be manipulated in order to produce an answer (or answer fragment). In this example, the first <source> is a definition of its corresponding fragment, and the second must be translated. The attribute class on <meta> indicates the function performed by the metalanguage. In this case, the first example is of the "locator" class, and the "concatenation" subtype, which indicates that the result of the second <source> is placed after the first. The second example is an operator, and an additional type attribute tells us that the operation required is translation. Finally, each individual word in the clue is enclosed in a <word> element, with an attribute indicating part of speech. The part-of-speech tagset used is the Penn Treebank tagset as modified for use in the Sketch Engine corpus software (Sketch Engine). The <solution> element contains basic information such as number of words, number of letters, and the plain text of the target answer, as well as its part of speech (if that can be determined). The target itself is then broken down into <unit> elements, each of which has an id attribute whose value corresponds to the id of the <source> element from which it was produced. For a double-definition clue, the markup looks a little different:

<item id="13">
    <clue>
        <def type="synonym">Alert</def>
        <def type="descriptive">goalkeeper may dive thus</def>
    </clue>
    <solution words="3" letters="9" text="on the ball" pos="PHRASE">
        <unit>ON THE BALL</unit>
    </solution>
</item>

Here there are two <def> elements, and the single <unit> element is not given an id attribute, as it is assumed that definitions always point towards the answer.

Clues with "container" type locator class metalanguage generally require embedding one <unit> within another in the answer. This is easily achieved by making it legal for one <unit> to have another as a child element:

<item id="6">
    <clue>
        <def type="descriptive">Irish leader</def>
        <subsidiary>
            <source class="anagram" id="1">is a cheat</source>
            <punct>,</punct>
            <meta class="operator" type="anagram">worried</meta>
            <meta class="locator" type="container">about</meta>
            <source class="abbreviation" id="2">nothing</source>
        </subsidiary>
    </clue>
    <solution words="1" letters="9" text="Taoiseach" pos="NP">
        <unit id="1">TA<unit id="2">O</unit>ISEACH</unit>
    </solution>
</item>

Another distinctive markup requirement is found in The Sun, which publishes "Two-Speed Crosswords". The same grid of target answers is offered with two sets of clues: one quick, and one cryptic. On the principle that it is always better to capture too much information than too little, I decided to add the quick clues in an initial <quick> element for these crosswords:

<item id="2">
    <clue>
        <quick>On fire</quick>
        <subsidiary>
            <source class="literal" id="1">A</source>
            <source class="synonym" id="2">fair</source>
        </subsidiary>
        <def type="synonym">land</def>
    </clue>
    <solution words="1" letters="6" text="alight" pos="VV">
        <unit id="1">A</unit>
        <unit id="2">LIGHT</unit>
    </solution>
</item>