Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match Unicode vowels?

What character class or Unicode property will match any Unicode vowel in Perl?

Wrong answer: [aeiouAEIOU]. (sermon here, item #24 in the laundry list)

perluniprops mentions vowels only for Hangul and Indic scripts.

Let's set aside the question what a vowel is. Yes, i may not be a vowel in some contexts. So, any character that can be a vowel will do.

like image 375
n.r. Avatar asked Aug 05 '16 15:08

n.r.


People also ask

How can you tell if a character is a vowel?

Now, to check whether ch is vowel or not, we check if ch is any of: ('a', 'e', 'i', 'o', 'u') . This is done using a simple if..else statement. We can also check for vowel or consonant using a switch statement in Java.

How do you check if a string is a vowel in Java?

To find the vowels in a given string, you need to compare every character in the given string with the vowel letters, which can be done through the charAt() and length() methods. charAt() : The charAt() function in Java is used to read characters at a particular index number.


2 Answers

There's no such property.

$ uniprops --all a
U+0061 <a> \N{LATIN SMALL LETTER A}
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    AHex POSIX_XDigit All Alnum X_POSIX_Alnum Alpha X_POSIX_Alpha Alphabetic Any ASCII
       ASCII_Hex_Digit Assigned Basic_Latin ID_Continue Is_IDC Cased Cased_Letter LC
       Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L
       Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Hex X_POSIX_XDigit Hex_Digit IDC ID_Start
       IDS Letter L_ Latin Latn Lowercase_Letter Lower X_POSIX_Lower Lowercase PerlWord POSIX_Word
       POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print Print X_POSIX_Print Unicode Word
       X_POSIX_Word XDigit XID_Continue XIDC XID_Start XIDS
    Age=1.1 Age=V1_1 Block=Basic_Latin Bidi_Class=L Bidi_Class=Left_To_Right BC=L
       Bidi_Paired_Bracket_Type=None Block=ASCII BLK=ASCII Canonical_Combining_Class=0
       Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR
       Decomposition_Type=None DT=None East_Asian_Width=Na East_Asian_Width=Narrow EA=Na
       Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
       Hangul_Syllable_Type=Not_Applicable HST=NA Indic_Positional_Category=NA InPC=NA
       Indic_Syllabic_Category=Other InSC=Other Joining_Group=No_Joining_Group JG=NoJoiningGroup
       Joining_Type=Non_Joining JT=U Joining_Type=U Script=Latin Line_Break=AL
       Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN
       Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0
       Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1
       Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0
       Present_In=6.1 IN=6.1 Present_In=6.2 IN=6.2 Present_In=6.3 IN=6.3 Present_In=7.0 IN=7.0
       Present_In=8.0 IN=8.0 SC=Latn Script=Latn Script_Extensions=Latin Scx=Latn
       Script_Extensions=Latn Sentence_Break=LO Sentence_Break=Lower SB=LO Word_Break=ALetter WB=LE
       Word_Break=LE

The most important thing when dealing with i18n is to think about what you actually need, yet you didn't even mention what you are trying to accomplish.

Find vowels? That can't be what you are actually trying to do. I could see a use for identifying vowel sounds in a word, but those are often formed from multiple letters (such as "oo" in English, and "in", "an"/"en", "ou", "ai", "au"/"eau", "eu" in French), and it would be language-specific.

As it stands, you're asking for a global solution but you're defining the problem in local terms. You first need to start by defining the actual problem you are trying to solve.

like image 75
ikegami Avatar answered Oct 23 '22 23:10

ikegami


Setting aside the definition of a vowel and the obvious problem that different languages share symbols but use them differently, there's a way that you can define your own property for use in a Perl pattern.

Define a subroutine that starts with In or Is and specify the characters that can be in it. The simplest is one code number be line, or a range of code numbers separated by horizontal whitespace:

#!perl
use v5.10;
use utf8;
use open qw(:std :utf8);

sub InSpecial {
    return <<"HERE";
00A7
00B6
2295\t229C
HERE
}


$_ = "ABC\x{00A7}";

say $_;
say /\p{InSpecial}/ ? 'Matched' : 'Missed';
like image 2
brian d foy Avatar answered Oct 24 '22 01:10

brian d foy