Grapheme clusters – YUM!

When you hear the words “grapheme clusters”, the first thing that springs to mind is a breakfast cereal. Now a grapheme is the “smallest unit of a writing system of any given language”. So a cluster is, well, a group of them. Makes sense right?

Now in the *old* days, a character set had 128 characters (think ASCII) in it, which isn’t exactly a lot. Well, it was a lot at the time. Now everyone wants extra characters, and we need characters in different languages, or mathematical characters. Enter Unicode. Partially this is because some characters, like say ü can’t really be represented by a single character, it is really two Unicode scalar values: the Unicode description for this letter is “LATIN SMALL LETTER U WITH DIAERESIS”, which is then decomposed into “LATIN SMALL LETTER U (U+0075) COMBINING DIAERESIS (U+0308)”.  This gives us the grapheme cluster.

Some languages such as Julia provide a function such a graphemes() to help iterate over grapheme clusters in a string. But one has to question whether this just complicates languages too much? Apart from being able to output these characters, which is obviously nice, do we need Unicode variable names like in Julia? So to use δ instead of the word delta, is as easy as using the LaTex code \delta followed by hitting the <tab> key. But is this practical? The use of the word pi or the symbol π associated with the same value, so that is nice.

Programming languages should sometimes be simpler than they are (like the good old days), and I have to think that adding extra stuff often makes them more complex than they have to be. Maybe that’s why people still like the simplicity of C – despite its idiosyncrasies.