A Notation for Character Collections for the WWW

“A Notation for Character Collections for the WWW” [9] (hereafter W3C notation for short) provides an XML syntax for describing subsets. Although it has not become a W3C recommendation and has not been implemented, it has a number of interesting ideas.

The W3C notation does not use regular expressions. Rather, it introduces XML elements (range and enum) for representing ranges and code points, respectively.

An interesting feature of the W3C notation is its kernel and hull elements. They are used to define open collections.

Unlike regular expressions, the W3C notation is equipped with a mechanism that references other subset descriptions or well-known subsets (e.g., collections in ISO/IEC 10646). This notation can thus easily describe subsets defined in terms of other subsets.

The W3C notation also has set operations (union, inverse, difference, and intersection). They allow subsets to be defined in terms of other subsets.

However, the W3C notation lacks mechanisms for describing grapheme clusters.