Introduction

Which character in Unicode are you willing to accept? If you receive UTF-8 text from somebody, it might contain any of the 136,690 code points of Unicode 10.

Accepting any Unicode character may lead to problems in the future. First, nobody can read all characters. Second, few fonts cover all characters. Third, some software such as document editors supports only a subset of Unicode.

Historically, legacy encodings have protected users from the proliferation of characters. For example, as long as you use Shift JIS, you only have to worry about 7,000 characters. But UTF-8 now exposes almost 88,000 CJK[14] ideographic characters.

In the Unicode era, we need a language for describing which character is to be allowed and then examining text against descriptions in this language. ISO/IEC 19757-7 Character REPertoire Description Language (CREPDL) [3] is an attempt of ISO/IEC JTC1/SC34 for such a language. Although the first edition was restricted to code points, the second edition can handle code point sequences, which represent grapheme clusters (“user-perceived characters”).

The rest of this paper is organized as follows. In the section called “Subsets in Unicode” and the section called “Subsets in ISO/IEC 10646”, we study subsets in Unicode and ISO/IEC 10646, respectively. In the section called “Existing Machine-readable Notations for Describing Subsets”, we study two existing machine-readable notations for describing subsets: regular expressions [7] and a W3C notation [9]. In particular, we make clear that regular expressions have performance problems for huge subsets and cannot directly capture subsets defined in terms of other subsets. In the section called “Design and Implementation of CREPDL”, we have a quick overview of the design and implementation of CREPDL and see how it overcomes limitations of the existing notations.



[14] Chinese, Japanese and Korean