Grapheme Clusters

A grapheme cluster [6] is a sequence of code points that represents“user-perceived characters”. A simple example is a base character followed by a combining character.

CONTEMPORARY LITHUANIAN LETTERS (collection 284) is the first collection containing grapheme clusters such as <004A, 0303> and <0069, 0307, 0301>. Note that 0303 is allowed to follow some code points (e.g, 004A), but is not allowed to follow others (e.g., 004B).

MOJI-JOHO-KIBAN IDEOGRAPHS-2016 (collection 390) is a collection applicable to persons' names in Japanese public service. This collection contains grapheme clusters such as <5289,E0101> and <5351,FE00>, where E0101 is an ideographic variation selector and FE00 is a variation selector. Although E0101 is allowed to follow 5289, it is not allowed to follow other characters (5288, for example).

The size of CONTEMPORARY LITHUANIAN LETTERS is much smaller than that of MOJI-JOHO-KIBAN IDEOGRAPHS-2016. The number of code points and grapheme clusters in CONTEMPORARY LITHUANIAN LETTERS (collection 284) is less than 100. But the number of code points in MOJI-JOHO-KIBAN IDEOGRAPHS-2016 is more than 52000 and that of grapheme clusters is more than 10000.