Unicode Regular Expressions

Unicode regular expressions [7] can be used for representing Unicode subsets. In fact, in the Unicode Common Locale Data Repository [1], subsets for each locale are represented by regular expressions. Subsubsection 5.3.3 (Unicode Sets) of Unicode Technical Standard #35 [8] describes the use of regular expressions for subsets and demonstrates the use of code points, ranges, code point sequences, and set operations (union, inverse, difference, and intersection).

Any collection defined in ISO/IEC 10646 can be represented by a Unicode regular expressions. In particular, code point sequences representing grapheme clusters (e.g., <5289,E0101>) can be represented by regular expressions (e.g., {\u5289\U000E0101}).

However, there are two problems. When collections are small, these problems are insignificant. But they become quite significant for large collections such as CJK collections.