CREPDL: Protect Yourself from the Proliferation of Unicode Characters

Makoto Murata

Keio University and JEPA

Abstract

This paper studies machine-readable notations for describing subsets of Unicode or ISO/IEC 10646. Unicode regular expressions can describe any subset, but they have performance problems for huge subsets and cannot directly capture subsets defined in terms of other subsets. Meanwhile, the upcoming second edition of ISO/IEC 19757-7 Character Repertoire Description Language (CREPDL) overcomes these problems by providing references to well-known subsets and external CREPDL scripts.


Table of Contents

Introduction
Subsets in Unicode
Subsets in ISO/IEC 10646
Collections
Code Points and Ranges
Open Collections and Fixed Collections
References to Other Collections
Grapheme Clusters
User-defined Subsets
Existing Machine-readable Notations for Describing Subsets
Unicode Regular Expressions
Referencing other subsets
Performance
A Notation for Character Collections for the WWW
Design and Implementation of CREPDL
Language Design
Implementation
Concluding Remarks and Future Works
Bibliography