CREPDL: Protect Yourself from the Proliferation of Unicode Characters

Makoto Murata

Keio University and JEPA


This paper studies machine-readable notations for describing subsets of Unicode or ISO/IEC 10646. Unicode regular expressions can describe any subset, but they have performance problems for huge subsets and cannot directly capture subsets defined in terms of other subsets. Meanwhile, the upcoming second edition of ISO/IEC 19757-7 Character Repertoire Description Language (CREPDL) overcomes these problems by providing references to well-known subsets and external CREPDL scripts.

Table of Contents

Subsets in Unicode
Subsets in ISO/IEC 10646
Code Points and Ranges
Open Collections and Fixed Collections
References to Other Collections
Grapheme Clusters
User-defined Subsets
Existing Machine-readable Notations for Describing Subsets
Unicode Regular Expressions
Referencing other subsets
A Notation for Character Collections for the WWW
Design and Implementation of CREPDL
Language Design
Concluding Remarks and Future Works