The previous section of this paper has given an outline of a sample of the elements that are used to construct a cryptic crossword clue. In developing a markup vocabulary for CCCP, it has been necessary to consider how these elements (as well as others not discussed above for reasons of space) relate to each other, and which of them are required in order to give a comprehensive account of the structure of a clue. There are, of course, elements above the level of the clue that could also be captured, such as the source of the crossword, the name of the setter, the date on which the crossword appeared, and so on.
Making decisions about how much detail to include required some predictions about the questions that might be asked of the corpus data at a later stage of the project. Cleary (1996) suggested some possible avenues for research, including examining differences between setters, and the use of polysemous words both in clues and as targets. To this might be added comparison of the "house styles" of various newspapers, comparison of the language of clues at different points in time, and the use of words with ambiguous part of speech. Evidently, these questions require marking up not only the structural features of the clue itself, but also of the name of the publication, the setter, the date, and the parts of speech of individual words. Since I am also interested in how various fragments of the target answers are commonly (or, indeed, uncommonly) represented in the subsidiary indications, there also needs to be a way to link each part of the clue to the corresponding part of the answer.
The following example shows a single clue and answer pair within
the root <corpus>
element:
<corpus> <publication pub="independent" type="anthology" ISBN="0550101756" pubdate="2005"> <puzzle id="1" setter="Aelred"> <item id="4"> <clue> <subsidiary> <source class="definition" id="1"> <word pos="JJ">Feeble</word> <word pos="NN">type</word> </source> <meta class="locator" type="concatenation"> <word pos="VVZ">adopts</word> </meta> <source class="translation" id="2"> <word pos="DT">the</word> </source> <meta class="operator" type="translation"> <word pos="JJ">French</word> </meta> </subsidiary> <def type="hypernym"> <word pos="NN">headdress</word> </def> </clue> <solution words="1" letters="6" text="wimple" pos="NN"> <unit id="1">WIMP</unit> <unit id="2">LE</unit> </solution> </item> </puzzle> </publication> </corpus>
(N.B. for ease of reading, part-of-speech tagging is omitted from all examples below).
The first element within the root,
<publication>
, captures the published source
for the crosswords. In this case, the source is an anthology, so the
ISBN and the publication date are also captured, as is the publication
- in this case, The Independent. Within this
element, the first <puzzle>
element has an
id attribute (to indicate its place
within the anthology) and a setter
attribute. This attribute must often have the value "anon", because
many anthologies fail to identify the specific setter of each puzzle.
Each <puzzle>
element contains a number of
<item>
elements; these contain the clue and
answer pairs.
The clue in the above example has the structure defined in
section 2 above: a subsidiary indication and a definition. These are
captured inside the <clue>
element, as
<subsidiary>
and
<def>
. The attribute type on <def>
indicates whether the definition element is a straight synonym, a
phrase, a narrative definition, or some other kind of semantic
relative of the target answer; in this case, the definition is a
hypernym (a term that is more general than the target answer). This
information will be useful for exploring questions of how different
kinds of definition might affect clue difficulty.
The child elements of <subsidiary>
are
<source>
(for the words that are manipulated
to restructure the target answer) and <meta>
(for metalanguage). The source
attribute class indicates how the word must be
manipulated in order to produce an answer (or answer fragment). In
this example, the first <source>
is a
definition of its corresponding fragment, and the second must be
translated. The attribute class on
<meta>
indicates the function performed by
the metalanguage. In this case, the first example is of the "locator"
class, and the "concatenation" subtype, which indicates that the
result of the second <source>
is placed after
the first. The second example is an operator, and an additional
type attribute tells us that the
operation required is translation. Finally, each individual word in
the clue is enclosed in a <word>
element,
with an attribute indicating part of speech. The part-of-speech tagset
used is the Penn Treebank tagset as modified for use in the Sketch
Engine corpus software (Sketch Engine). The
<solution>
element contains basic information
such as number of words, number of letters, and the plain text of the
target answer, as well as its part of speech (if that can be
determined). The target itself is then broken down into
<unit>
elements, each of which has an
id attribute whose value corresponds
to the id of the
<source>
element from which it was produced.
For a double-definition clue, the markup looks a little
different:
<item id="13"> <clue> <def type="synonym">Alert</def> <def type="descriptive">goalkeeper may dive thus</def> </clue> <solution words="3" letters="9" text="on the ball" pos="PHRASE"> <unit>ON THE BALL</unit> </solution> </item>
Here there are two <def>
elements, and
the single <unit>
element is not given an
id attribute, as it is assumed that
definitions always point towards the answer.
Clues with "container" type locator class metalanguage generally require
embedding one <unit>
within another in the
answer. This is easily achieved by making it legal for one
<unit>
to have another as a child
element:
<item id="6"> <clue> <def type="descriptive">Irish leader</def> <subsidiary> <source class="anagram" id="1">is a cheat</source> <punct>,</punct> <meta class="operator" type="anagram">worried</meta> <meta class="locator" type="container">about</meta> <source class="abbreviation" id="2">nothing</source> </subsidiary> </clue> <solution words="1" letters="9" text="Taoiseach" pos="NP"> <unit id="1">TA<unit id="2">O</unit>ISEACH</unit> </solution> </item>
Another distinctive markup requirement is found in The
Sun, which publishes "Two-Speed Crosswords". The same grid
of target answers is offered with two sets of clues: one quick, and
one cryptic. On the principle that it is always better to capture too
much information than too little, I decided to add the quick clues in
an initial <quick>
element for these
crosswords:
<item id="2"> <clue> <quick>On fire</quick> <subsidiary> <source class="literal" id="1">A</source> <source class="synonym" id="2">fair</source> </subsidiary> <def type="synonym">land</def> </clue> <solution words="1" letters="6" text="alight" pos="VV"> <unit id="1">A</unit> <unit id="2">LIGHT</unit> </solution> </item>