Language Design

CREPDL is intended to combine the best parts of Unicode regular expressions and the W3C notation. Unlike regular expressions, CREPDL can easily handle large collections. Unlike the W3C notation, CREPDL can handle grapheme clusters.

First, CREPDL allows the use of Unicode regular expressions as atomic expressions. This is done by the char element of CREPDL. Note that sequences of code points, which represent grapheme clusters, can be represented by regular expressions.

Second, CREPDL borrows mechanisms of the W3C notation with some modifications.

The CREPDL processor has two working modes: character and graphmeCluster. If the mode is character, the CREPDL processor examines each code point in the input text stream. If the mode is graphemeCluster, the CREPDL processor extracts grapheme clusters from the text stream by applying the algorithm as defined in [6]. It then validates each grapheme cluster.

Huge well-known collections referenced by <repertoire> can be implemented by hash-based sets. Thus, the CREPDL processor can handle such collections very efficiently.

This paper does not cover details of the CREPDL language. Interested readers are encouraged to review the CD or upcoming DIS for ISO/IEC 19757-7.