Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the right way to get a grapheme?

Why does this print a U and not a Ü?

#!/usr/bin/env perl
use warnings;
use 5.014;
use utf8;
binmode STDOUT, ':utf8';
use charnames qw(:full);

my $string = "\N{LATIN CAPITAL LETTER U}\N{COMBINING DIAERESIS}";

while ( $string =~ /(\X)/g ) {
        say $1;
}

# Output: U
like image 575
sid_com Avatar asked Feb 24 '12 10:02

sid_com


People also ask

What is the meaning of we explain what grapheme?

We explain what graphemes are and how you can help your child understand the concept at home. What is a grapheme? A grapheme is a written symbol that represents a sound ( phoneme ). This can be a single letter, or could be a sequence of letters, such as ai, sh, igh, tch etc.

What is the difference between a phoneme and a grapheme?

So when a child says the sound /t/ this is a phoneme, but when they write the letter 't' this is a grapheme. These are all the phonemes in the English language (and some of the graphemes used to represent them):

What are some good examples of graphemes?

Examples and Observations 1 Trevor A. Harley. ... 2 Linda C. Ehrie. ... 3 David Crystal 4 Graphemes. In the English alphabet, the switch from cat to bat introduces a meaning change; therefore, c and b represent different graphemes. 5 Florian Coulmas. ... 6 Cauline B. ...

What is the adjective for graphemic?

Adjective: graphemic. The grapheme has been described as the "smallest contrastive linguistic unit which may bring about a change of meaning" (A.C. Gimson, An Introduction to the Pronunciation of English). Matching a grapheme to a phoneme (and vice versa) is called a grapheme-phoneme correspondence.


2 Answers

Your code is correct.

You really do need to play these things by the numbers; don’t trust what a "terminal" displays. Pipe it through the uniquote program, probably with -x or -v, and see what it is really doing.

Eyes deceive, and programs are even worse. Your terminal program is buggy, so is lying to you. Normalization shouldn’t matter.

$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "crème brûlée"'
crème brûlée
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "crème brûlée"' | uniquote -x
cr\x{E8}me br\x{FB}l\x{E9}e
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "crème brûlée"' 
crème brûlée
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "crème brûlée"' | uniquote -x
cre\x{300}me bru\x{302}le\x{301}e

$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "crème brûlée"' 
éel̂urb em̀erc
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "crème brûlée")' | uniquote -x
\x{E9}el\x{302}urb em\x{300}erc
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "crème brûlée"'
éel̂urb em̀erc
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "crème brûlée"' | uniquote -x
e\x{301}el\x{302}urb em\x{300}erc
like image 160
tchrist Avatar answered Oct 04 '22 01:10

tchrist


This works for me, though I have an older version of perl, 5.012, on ubuntu. My only change to your script is: use 5.012;

$ perl so.pl 
Ü
like image 31
beerbajay Avatar answered Oct 04 '22 01:10

beerbajay